2025-03-10

Title: Leveraging Large Language Models For Optimized Item Categorization using UNSPSC Taxonomy

Authors: Anmolika Singh, Yuhang Diao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04728
Pdf URL: https://arxiv.org/pdf/2503.04728
Copy Paste: [[2503.04728]] Leveraging Large Language Models For Optimized Item Categorization using UNSPSC Taxonomy(https://arxiv.org/abs/2503.04728)
Keywords: language model, llm
Abstract: Effective item categorization is vital for businesses, enabling the transformation of unstructured datasets into organized categories that streamline inventory management. Despite its importance, item categorization remains highly subjective and lacks a uniform standard across industries and businesses. The United Nations Standard Products and Services Code (UNSPSC) provides a standardized system for cataloguing inventory, yet employing UNSPSC categorizations often demands significant manual effort. This paper investigates the deployment of Large Language Models (LLMs) to automate the classification of inventory data into UNSPSC codes based on Item Descriptions. We evaluate the accuracy and efficiency of LLMs in categorizing diverse datasets, exploring their language processing capabilities and their potential as a tool for standardizing inventory classification. Our findings reveal that LLMs can substantially diminish the manual labor involved in item categorization while maintaining high accuracy, offering a scalable solution for businesses striving to enhance their inventory management practices.
摘要：有效的项目分类对于企业至关重要，可以使非结构化数据集转换为简化库存管理的有组织类别。尽管重要性很重要，但项目分类仍然高度主观，并且在行业和企业中缺乏统一的标准。联合国标准产品和服务代码（UNDPC）提供了一个标准化的系统来编目库存，但是采用UNSPSC分类通常需要大量的手动努力。本文研究了大型语言模型（LLMS）的部署，以根据项目描述将库存数据分类为UNSPSC代码。我们评估了LLM在分类各种数据集，探索其语言处理能力及其潜力作为标准化库存分类工具的潜力方面的准确性和效率。我们的发现表明，LLM可以大大减少项目分类中涉及的体力劳动，同时保持高精度，为努力增强其库存管理实践的企业提供可扩展的解决方案。

Title: WinClick: GUI Grounding with Multimodal Large Language Models

Authors: Zheng Hui, Yinheng Li, Dan zhao, Tianyi Chen, Colby Banbury, Kazuhito Koishida
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2503.04730
Pdf URL: https://arxiv.org/pdf/2503.04730
Copy Paste: [[2503.04730]] WinClick: GUI Grounding with Multimodal Large Language Models(https://arxiv.org/abs/2503.04730)
Keywords: language model, llm, agent
Abstract: Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key challenge in developing visual GUI agents: GUI grounding - the ability to accurately locate screen elements based on instructions. However, most existing GUI agents rely on structured data formats like DOM or HTML files in training or inferencing, which are inaccessible across all applications, particular in a general desktop environments such as Windows OS. To address this, we introduce WinClick, a novel visual GUI agent developed in Windows platform. WinClick leverages screenshots to detect actionable regions. To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training and propose an LLM-based method for aligning GUI grounding data. Additionally, we introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows. Our experiments demonstrate that WinClick, combined with GUI grounding pre-training, significantly outperforms existing baselines, offering a scalable solution for GUI automation in desktop environments. WinSpot is publicly available at this https URL.
摘要：图形用户界面（GUI）任务对于自动化工作流程（例如软件测试，用户界面导航）至关重要。对于用户而言，GUI是与计算机交互的最直观平台。先前的工作确定了开发视觉GUI代理的关键挑战：GUI接地 - 根据说明准确定位屏幕元素的能力。但是，大多数现有的GUI代理都依赖于培训或推论中的DOM或HTML文件（例如DOM或HTML文件），这些数据格式在所有应用程序中都无法访问，尤其是在Windows OS等一般桌面环境中。为了解决这个问题，我们介绍了Winclick，这是Windows平台中开发的新型Visual GUI代理。 Winclick利用屏幕截图来检测可起作用的区域。为了克服GUI接地的挑战，我们通过GUI接地预训练增强了Winclick，并提出了一种基于LLM的方法来对齐GUI接地数据。此外，我们介绍了Winspot，这是在窗户上进行GUI接地的第一个综合基准。我们的实验表明，温克里克（Winclick）与GUI接地预训练相结合，明显优于现有基准，为桌面环境中的GUI自动化提供了可扩展的解决方案。 WinSpot在此HTTPS URL上公开可用。

Title: DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi

Authors: Yansong Ning, Shuowei Cai, Wei Li, Jun Fang, Naiqiang Tan, Hua Chai, Hao Liu
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2503.04768
Pdf URL: https://arxiv.org/pdf/2503.04768
Copy Paste: [[2503.04768]] DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi(https://arxiv.org/abs/2503.04768)
Keywords: llm, agent
Abstract: On-demand ride-hailing services like DiDi, Uber, and Lyft have transformed urban transportation, offering unmatched convenience and flexibility. In this paper, we introduce DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing. Its goal is to provide seamless ride-hailing services and beyond through a natural and efficient conversational interface under dynamic and complex spatiotemporal urban contexts. To achieve this, we propose a spatiotemporal-aware order planning module that leverages external tools for precise spatiotemporal reasoning and progressive order planning. Additionally, we develop a cost-effective dialogue system that integrates multi-type dialog repliers with cost-aware LLM configurations to handle diverse conversation goals and trade-off response quality and latency. Furthermore, we introduce a continual fine-tuning scheme that utilizes real-world interactions and simulated dialogues to align the assistant's behavior with human preferred decision-making processes. Since its deployment in the DiDi application, DiMA has demonstrated exceptional performance, achieving 93% accuracy in order planning and 92% in response generation during real-world interactions. Offline experiments further validate DiMA capabilities, showing improvements of up to 70.23% in order planning and 321.27% in response generation compared to three state-of-the-art agent frameworks, while reducing latency by $0.72\times$ to $5.47\times$. These results establish DiMA as an effective, efficient, and intelligent mobile assistant for ride-hailing services.
摘要：Didi，Uber和Lyft等按需乘车服务已改变了城市交通，提供了无与伦比的便利性和灵活性。在本文中，我们介绍了DiDi Chuxing部署的LLM驱车助手Dima。它的目标是在动态和复杂的时空城市环境下通过自然而有效的对话界面提供无缝的乘车服务。为了实现这一目标，我们提出了一个时空感知的订单计划模块，该模块利用外部工具来精确时空推理和渐进式订单计划。此外，我们开发了一种具有成本效益的对话系统，该系统将多类型对话框重新架集成到成本吸引的LLM配置，以处理多种对话目标以及权衡响应质量和延迟。此外，我们引入了一个持续的微调计划，该方案利用现实世界的交互和模拟对话将助手的行为与人类首选决策过程保持一致。由于DIMA在DIDI应用程序中的部署中的部署表现出了出色的性能，在订单计划中达到了93％的准确性，在现实世界中的互动过程中获得了92％的响应生成。离线实验进一步验证了DIMA功能，与三个最先进的代理框架相比，在订单计划中的提高高达70.23％，响应产生321.27％的提高，同时将延迟降低为$ 0.72 \ $ $ 5.47 \ $ 5.47 \ times $。这些结果将DIMA确定为乘车服务的有效，高效且聪明的移动助手。

Title: Invisible Walls in Cities: Leveraging Large Language Models to Predict Urban Segregation Experience with Social Media Content

Authors: Bingbing Fan, Lin Chen, Songwei Li, Jian Yuan, Fengli Xu, Pan Hui, Yong Li
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2503.04773
Pdf URL: https://arxiv.org/pdf/2503.04773
Copy Paste: [[2503.04773]] Invisible Walls in Cities: Leveraging Large Language Models to Predict Urban Segregation Experience with Social Media Content(https://arxiv.org/abs/2503.04773)
Keywords: language model, llm
Abstract: Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose using Large Language Models (LLMs) to automate online review mining for segregation prediction. We design a Reflective LLM Coder to digest social media content into insights consistent with real-world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning-and-EMbedding (RE'EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi-channel features for segregation prediction. Experiments on real-world data demonstrate that our framework greatly improves prediction accuracy, with a 22.79% elevation in R2 and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction this http URL, our user study confirms that the codebook-guided summaries provide cognitive gains for human participants in perceiving POIs' social this http URL study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with AI.
摘要：了解城市日常生活中经验丰富的种族隔离对于解决社会不平等和培养包容性至关重要。社交媒体上的大量用户生成的评论囊括了与不同地方相关的细微看法和感受，从而提供了丰富的隔离见解。但是，利用这些数据构成了巨大的挑战，由于其众多，模棱两可和各种观点的融合。为了应对这些挑战，我们建议使用大型语言模型（LLM）自动化在线评论挖掘以进行隔离预测。我们设计了一个反思性的LLM编码器，以将社交媒体内容消化为与现实世界反馈一致的见解，并最终生成一个代码手册，捕获了信号隔离经验的关键维度，例如文化共鸣和吸引力，可及性，可访问性和便利性，社区参与和社区参与和本地参与。在代码簿的指导下，LLM可以同时生成内容丰富的审核摘要和隔离预测的评分。此外，我们设计了一个推理和插入（RE'EM）框架，该框架结合了语言模型的推理和嵌入功能，以集成多通道特征以进行隔离预测。现实世界中数据的实验表明，我们的框架大大提高了预测准确性，R2的高度为22.79％，MSE降低了9.33％。派生的代码手册可以在三个不同的城市之间推广，从而不断改善该HTTP URL的预测，我们的用户研究证实，代码手册引入的摘要为人类参与者提供了认知的增长，以感知POIS'社交'这项HTTP URL研究标志着在理解隐式社会障碍和中等范围内的社交潜力，表现出具有社交性的重要性，以促进社会的社会性能，从而促进了社交的潜力。

Title: MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

Authors: Sumin Ha, Jun Hyeong Kim, Yinhua Piao, Sun Kim
Subjects: cs.CL, cs.AI, physics.atom-ph
Abstract URL: https://arxiv.org/abs/2503.04780
Pdf URL: https://arxiv.org/pdf/2503.04780
Copy Paste: [[2503.04780]] MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model(https://arxiv.org/abs/2503.04780)
Keywords: language model, llm
Abstract: Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM's ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in this https URL.
摘要：人类化学和生物医学方面的专业知识依赖于上下文分子的理解，大语言模型（LLMS）可以通过分子结构和文本之间的细粒度对齐来扩展的能力。最近的多模式学习的进步集中在跨模式对齐方式上，但是现有的分子文本模型忽略了不同分子视图中的互补信息，而依靠单视图，从而限制了分子理解。此外，幼稚的多视图对齐策略面临两个挑战：（1）分子和文本嵌入之间的映射不一致的单独的对齐空间，并且（2）现有的损失目标无法保留互补信息以获得细粒度对齐的互补信息。这可以限制LLM充分理解分子特性的能力。为了解决这些问题，我们提出了MV-CLAM，这是一个新型框架，使用多Query Transformer（MQ-Former）将多视图分子表示与统一的文本空间保持一致。我们的方法可确保跨视图的一致性，而令牌级的对比损失可以保留在文本查询中的各种分子特征。 MV-clam增强了分子推理，提高了检索和字幕的准确性。 MV-Clam的源代码可在此HTTPS URL中获得。

Title: Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects

Authors: Anichur Rahman, Shahariar Hossain Mahir, Md Tanjum An Tashrif, Airin Afroj Aishi, Md Ahsan Karim, Dipanjali Kundu, Tanoy Debnath, Md. Abul Ala Moududi, MD. Zunead Abedin Eidmum
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2503.04783
Pdf URL: https://arxiv.org/pdf/2503.04783
Copy Paste: [[2503.04783]] Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects(https://arxiv.org/abs/2503.04783)
Keywords: language model, gpt, llm, chat
Abstract: Nowadays, DeepSeek, ChatGPT, and Google Gemini are the most trending and exciting Large Language Model (LLM) technologies for reasoning, multimodal capabilities, and general linguistic performance worldwide. DeepSeek employs a Mixture-of-Experts (MoE) approach, activating only the parameters most relevant to the task at hand, which makes it especially effective for domain-specific work. On the other hand, ChatGPT relies on a dense transformer model enhanced through reinforcement learning from human feedback (RLHF), and then Google Gemini actually uses a multimodal transformer architecture that integrates text, code, and images into a single framework. However, by using those technologies, people can be able to mine their desired text, code, images, etc, in a cost-effective and domain-specific inference. People may choose those techniques based on the best performance. In this regard, we offer a comparative study based on the DeepSeek, ChatGPT, and Gemini techniques in this research. Initially, we focus on their methods and materials, appropriately including the data selection criteria. Then, we present state-of-the-art features of DeepSeek, ChatGPT, and Gemini based on their applications. Most importantly, we show the technological comparison among them and also cover the dataset analysis for various applications. Finally, we address extensive research areas and future potential guidance regarding LLM-based AI research for the community.
摘要：如今，DeepSeek，Chatgpt和Google Gemini是推理，多模式能力和全球一般语言性能的最流行和令人兴奋的大型语言模型（LLM）技术。 DeepSeek采用了专家的混合物（MOE）方法，仅激活与手头任务最相关的参数，这使其对特定于域的工作特别有效。另一方面，Chatgpt依赖于通过从人类反馈（RLHF）学习增强的密集变压器模型，然后Google Gemini实际上使用了将文本，代码和图像集成到单个框架中的多模式变压器体系结构。但是，通过使用这些技术，人们可以通过具有成本效益和特定领域的推断来挖掘其所需的文本，代码，图像等。人们可以根据最佳性能选择这些技术。在这方面，我们在这项研究中提供了基于DeepSeek，Chatgpt和Gemini技术的比较研究。最初，我们专注于他们的方法和材料，适当地包括数据选择标准。然后，我们根据其应用程序介绍DeepSeek，Chatgpt和Gemini的最先进功能。最重要的是，我们显示了它们之间的技术比较，还涵盖了各种应用程序的数据集分析。最后，我们针对基于LLM的AI研究的广泛研究领域和未来的潜在指导。

Title: KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework

Authors: Jiexiong Liu, Yixuan Chen, Yanqin Jia, Zhepeng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04784
Pdf URL: https://arxiv.org/pdf/2503.04784
Copy Paste: [[2503.04784]] KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework(https://arxiv.org/abs/2503.04784)
Keywords: language model, llm
Abstract: Large language models have demonstrated remarkable performance across various tasks, yet they face challenges such as low computational efficiency, gradient vanishing, and difficulties in capturing complex feature interactions. To address these limitations, a novel framework has been proposed. This framework incorporates a learnable dense residual skip connection mechanism, a TransformerX module a transformer based component integrating multiscale convolution and adaptive activation functions and a multitoken prediction interaction module. The learnable dense residual connections enhance information flow and feature capture across layers. Within the TransformerX module, large convolutional kernels aggregate semantic information from extensive text segments, while smaller convolutions focus on local word order and syntactic structures. The adaptive activation function dynamically adjusts its parameters based on the semantic features of the input text, improving the model's ability to handle diverse semantic expressions and complex relationships. The multitoken prediction module boosts data utilization and accelerates inference by predicting multiple future tokens. These components significantly enhance the performance and efficiency of large language models.
摘要：大型语言模型已经在各种任务中表现出了出色的表现，但是它们面临着诸如计算效率低，消失和捕获复杂功能相互作用的困难等挑战。为了解决这些局限性，已经提出了一个新颖的框架。该框架结合了一个可学习的密集剩余跳过连接机制，Transfererx模块基于变压器的组件集成了多尺度卷积和自适应激活功能以及多语的预测相互作用模块。可学习的密集残差连接可以增强信息流和跨层的特征捕获。在Transformerx模块中，大量的卷积内核从广泛的文本段聚集了语义信息，而较小的卷积则集中在本地单词顺序和句法结构上。自适应激活函数根据输入文本的语义特征动态调整其参数，从而提高了模型处理各种语义表达式和复杂关系的能力。多语预测模块通过预测多个将来的令牌来促进数据利用和加速推理。这些组件大大提高了大语言模型的性能和效率。

Title: Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice

Authors: José Siqueira de Cerqueira, Kai-Kristian Kemell, Rebekah Rousi, Nannan Xi, Juho Hamari, Pekka Abrahamsson
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.04785
Pdf URL: https://arxiv.org/pdf/2503.04785
Copy Paste: [[2503.04785]] Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice(https://arxiv.org/abs/2503.04785)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The rapid proliferation of Large Language Models (LLMs) has raised pressing concerns regarding their trustworthiness, spanning issues of reliability, transparency, fairness, and ethical alignment. Despite the increasing adoption of LLMs across various domains, there remains a lack of consensus on how to operationalize trustworthiness in practice. This study bridges the gap between theoretical discussions and implementation by conducting a bibliometric mapping analysis of 2,006 publications from 2019 to 2025. Through co-authorship networks, keyword co-occurrence analysis, and thematic evolution tracking, we identify key research trends, influential authors, and prevailing definitions of LLM trustworthiness. Additionally, a systematic review of 68 core papers is conducted to examine conceptualizations of trust and their practical implications. Our findings reveal that trustworthiness in LLMs is often framed through existing organizational trust frameworks, emphasizing dimensions such as ability, benevolence, and integrity. However, a significant gap exists in translating these principles into concrete development strategies. To address this, we propose a structured mapping of 20 trust-enhancing techniques across the LLM lifecycle, including retrieval-augmented generation (RAG), explainability techniques, and post-training audits. By synthesizing bibliometric insights with practical strategies, this study contributes towards fostering more transparent, accountable, and ethically aligned LLMs, ensuring their responsible deployment in real-world applications.
摘要：大型语言模型（LLM）的快速扩散引起了人们对他们的可信赖性，跨越可靠性，透明度，公平和道德一致性问题的紧迫关注。尽管LLM在各个领域的采用越来越多，但仍缺乏如何在实践中运作信任度达成共识。这项研究通过对2019年至2025年的2,006个出版物进行书目映射分析来弥合理论讨论与实施之间的差距。通过共同授权网络，关键字共同发生分析和主题演化跟踪，我们确定了关键的研究趋势，有影响力的作者，有影响力的作者，以及对LLM Trust的定义。此外，还对68篇核心论文进行了系统评价，以检查信任的概念及其实际含义。我们的发现表明，LLMS中的可信度通常是通过现有的组织信任框架构成的，强调了能力，仁慈和正直等维度。但是，将这些原则转化为具体的发展策略存在很大的差距。为了解决这个问题，我们提出了整个LLM生命周期中20种信任增强技术的结构化映射，包括检索成绩（RAG），解释性技术和培训后审核。通过将文献计量学的见解与实用策略合成，这项研究有助于促进更透明，负责和道德上一致的LLM，从而确保其在现实世界中的负责任部署。

Title: Towards Anthropomorphic Conversational AI Part I: A Practical Framework

Authors: Fei Wei, Yaliang Li, Bolin Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04787
Pdf URL: https://arxiv.org/pdf/2503.04787
Copy Paste: [[2503.04787]] Towards Anthropomorphic Conversational AI Part I: A Practical Framework(https://arxiv.org/abs/2503.04787)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs), due to their advanced natural language capabilities, have seen significant success in applications where the user interface is usually a conversational artificial intelligence (AI) agent and engages the user through multi-round conversations. However, many scenarios require the agents to exhibit stronger social and conversational intelligence and demonstrate more human-like (anthropomorphic) reactions. This is an aspect that foundational LLMs have yet to fully address such that a single call of foundational models might be insufficient. To bridge this gap, we propose a two-stage solution. In this work, we focus on the first stage, introducing a multi-module framework designed to replicate the key aspects of human intelligence involved in conversations. This framework comprises thinking modules for reasoning, resource modules for managing knowledge and external information, and response modules for generating contextually appropriate interactions. With all the modules cooperating, the framework would empower the agents to provide a better human-like conversation experience. In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning, enabling AI to better capture human preferences. This stage is left for future work. In our experiments, volunteers engaged in over 3000 rounds of conversation with the same AI character powered by a standalone LLM and our framework which integrates the same LLM. A separate group of evaluators rated the conversation samples, revealing that our framework significantly enhanced the social and conversational intelligence, even without fine-tuning the LLM.
摘要：大型语言模型（LLMS）由于其先进的自然语言能力，在用户界面通常是对话性人工智能（AI）代理的应用程序中取得了重大成功，并通过多轮对话使用户互动。但是，许多场景要求代理商表现出更强的社会和对话智力，并表现出更类似人类的（拟人化）反应。这是基础LLM尚未完全解决的一个方面，因此基础模型的单个呼叫可能不足。为了弥合这一间隙，我们提出了一个两阶段的解决方案。在这项工作中，我们专注于第一阶段，引入了一个多模块框架，旨在复制对话中涉及的人类智能的关键方面。该框架包括用于推理的思维模块，用于管理知识和外部信息的资源模块以及用于生成上下文适当交互的响应模块。随着所有模块的合作，该框架将使代理商有能力提供更好的人类般的对话体验。在我们方法的第二阶段中，这些对话数据在过滤和标记后可以作为增强学习的培训和测试数据，从而使AI更好地捕获人类的偏好。这个阶段留给以后的工作。在我们的实验中，志愿者与由独立LLM供电的相同AI角色进行了3000多轮对话，我们的框架集成了相同的LLM。一组评估者对对话样本进行了评估，表明我们的框架显着增强了社交和对话智力，即使没有微调LLM。

Title: AgroLLM: Connecting Farmers and Agricultural Practices through Large Language Models for Enhanced Knowledge Transfer and Practical Application

Authors: Dinesh Jackson Samuel, Inna Skarga-Bandurova, David Sikolia, Muhammad Awais
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04788
Pdf URL: https://arxiv.org/pdf/2503.04788
Copy Paste: [[2503.04788]] AgroLLM: Connecting Farmers and Agricultural Practices through Large Language Models for Enhanced Knowledge Transfer and Practical Application(https://arxiv.org/abs/2503.04788)
Keywords: language model, gpt, llm, chat, retrieval-augmented generation
Abstract: AgroLLM is an AI-powered chatbot designed to enhance knowledge-sharing and education in agriculture using Large Language Models (LLMs) and a Retrieval-Augmented Generation (RAG) framework. By using a comprehensive open-source agricultural database, AgroLLM provides accurate, contextually relevant responses while reducing incorrect information retrieval. The system utilizes the FAISS vector database for efficient similarity searches, ensuring rapid access to agricultural knowledge. A comparative study of three advanced models: Gemini 1.5 Flash, ChatGPT-4o Mini, and Mistral-7B-Instruct-v0.2 was conducted to evaluate performance across four key agricultural domains: Agriculture and Life Sciences, Agricultural Management, Agriculture and Forestry, and Agriculture Business. Key evaluation metrics included embedding quality, search efficiency, and response relevance. Results indicated that ChatGPT-4o Mini with RAG achieved the highest accuracy at 93%. Continuous feedback mechanisms enhance response quality, making AgroLLM a benchmark AI-driven educational tool for farmers, researchers, and professionals, promoting informed decision-making and improved agricultural practices.
摘要：Agrollm是一种AI驱动的聊天机器人，旨在使用大语言模型（LLM）和检索功能增强的一代（RAG）框架来增强农业知识共享和教育。通过使用全面的开源农业数据库，Agrollm提供了准确的，上下文相关的响应，同时减少了错误的信息检索。该系统利用FAISS矢量数据库进行有效的相似性搜索，从而确保快速获取农业知识。对三种高级模型的比较研究：Gemini 1.5 Flash，Chatgpt-4O Mini和Mistral-7b-Instruct-V0.2进行了评估，以评估四个关键农业领域的绩效：农业和生活科学，农业管理，农业管理，农业和林业和农业商业。关键评估指标包括嵌入质量，搜索效率和响应相关性。结果表明，带有RAG的Chatgpt-4O Mini的准确性最高，为93％。连续反馈机制提高了响应质量，使Agrollm成为农民，研究人员和专业人士的基准AI驱动的教育工具，促进了知情的决策和改善的农业实践。

Title: Ext2Gen: Alignment through Unified Extraction and Generation for Robust Retrieval-Augmented Generation

Authors: Hwanjun Song, Jeonghwan Choi, Minseok Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04789
Pdf URL: https://arxiv.org/pdf/2503.04789
Copy Paste: [[2503.04789]] Ext2Gen: Alignment through Unified Extraction and Generation for Robust Retrieval-Augmented Generation(https://arxiv.org/abs/2503.04789)
Keywords: llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances LLMs by integrating external knowledge, but generation remains fragile due to the uncertain placement of relevant chunks and retrieval-induced information overload, leading to hallucinations. We propose Ext2Gen, a novel extract-then-generate model that enhances RAG robustness by first extracting query-relevant sentences before generating answers. To optimize this model, we employ preference alignment through pairwise feedback learning, enabling the model to generate robust answers regardless of variations in retrieval results. Extensive experiments demonstrate that Ext2Gen effectively identifies query-relevant sentences with high precision and recall, leading to highly reliable answers. Furthermore, deploying our model in a RAG environment reveals that it not only boosts the performance of the base LLM but also synergizes with advanced retrieval strategies like query expansion. The dataset and model will be released soon.
摘要：检索增强的生成（RAG）通过整合外部知识来增强LLM，但是由于不确定相关块和检索引起的信息过载的位置不确定，产生仍然脆弱，从而导致幻觉。我们提出了一种新颖的提取物，然后是生成的模型，该模型通过先提取与查询相关的句子在生成答案之前，从而增强抹布的鲁棒性。为了优化该模型，我们通过成对反馈学习采用偏好对齐方式，使该模型能够生成可靠的答案，而不管检索结果的变化如何。广泛的实验表明，Ext2Gen有效地以高精度和回忆为单位确定了与查询相关的句子，从而提供了高度可靠的答案。此外，将我们的模型部署在抹布环境中表明，它不仅可以提高基本LLM的性能，而且还可以与高级检索策略（如查询扩展）协同作用。数据集和模型将很快发布。

Title: Cross-linguistic disagreement as a conflict of semantic alignment norms in multilingual AI~Linguistic Diversity as a Problem for Philosophy, Cognitive Science, and AI~

Authors: Masaharu Mizumoto, Dat Tien Nguyen, Justin Sytsma, Mark Alfano, Yu Izumi, Koji Fujita, Nguyen Le Minh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04792
Pdf URL: https://arxiv.org/pdf/2503.04792
Copy Paste: [[2503.04792]] Cross-linguistic disagreement as a conflict of semantic alignment norms in multilingual AI~Linguistic Diversity as a Problem for Philosophy, Cognitive Science, and AI~(https://arxiv.org/abs/2503.04792)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) face an often-overlooked challenge stemming from intrinsic semantic differences across languages. Linguistic divergence can sometimes lead to cross-linguistic disagreements--disagreements purely due to semantic differences about a relevant concept. This paper identifies such disagreements as conflicts between two fundamental alignment norms in multilingual LLMs: cross-linguistic consistency (CL-consistency), which seeks universal concepts across languages, and consistency with folk judgments (Folk-consistency), which respects language-specific semantic norms. Through examining responses of conversational multilingual AIs in English and Japanese with the cases used in philosophy (cases of knowledge-how attributions), this study demonstrates that even state-of-the-art LLMs provide divergent and internally inconsistent responses. Such findings reveal a novel qualitative limitation in crosslingual knowledge transfer, or conceptual crosslingual knowledge barriers, challenging the assumption that universal representations and cross-linguistic transfer capabilities are inherently desirable. Moreover, they reveal conflicts of alignment policies of their developers, highlighting critical normative questions for LLM researchers and developers. The implications extend beyond technical alignment challenges, raising normative, moral-political, and metaphysical questions about the ideals underlying AI development--questions that are shared with philosophers and cognitive scientists but for which no one yet has definitive answers, inviting a multidisciplinary approach to balance the practical benefits of cross-linguistic consistency and respect for linguistic diversity.
摘要：多语言大型语言模型（LLMS）面临着一个经常被忽视的挑战，源于语言之间的内在语义差异。语言差异有时会导致跨语言分歧 - 纯粹是由于有关相关概念的语义差异。本文将这种分歧确定为多语言LLM中的两个基本对齐规范之间的冲突：跨语言一致性（CL矛盾），该规范跨语言寻求普遍的概念，并与民间判断（民间矛盾）一致，尊重语言特异性的语义规范。通过研究哲学中使用的案例（知识归因的案例），通过检查对话式多语言AIS的反应，这项研究表明，即使是最新的LLMS也提供了不同的内部和内部不一致的响应。这样的发现揭示了跨语言知识转移或概念性跨语言知识障碍的新型定性限制，这挑战了普遍表示和跨语言转移能力的假设本质上是可取的。此外，他们揭示了开发人员的一致性政策冲突，突出了LLM研究人员和开发人员的关键规范性问题。这些含义超出了技术一致性的挑战，提高了有关AI基础发展的理想的规范性，道德政治和形而上学的问题 - 与哲学家和认知科学家共享的问题，但尚无人权答案，邀请多学科的方法，邀请一种多学科的方法来平衡跨语言一致性和尊重语言的交叉效果。

Title: Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference

Authors: Wenjie Qiu, Yi-Chen Li, Xuqin Zhang, Tianyi Zhang, Yihang Zhang, Zongzhang Zhang, Yang Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04793
Pdf URL: https://arxiv.org/pdf/2503.04793
Copy Paste: [[2503.04793]] Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference(https://arxiv.org/abs/2503.04793)
Keywords: language model, llm
Abstract: Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).
摘要：从人类偏好数据集中学习奖励模型，并随后通过强化学习来优化语言模型，这已成为将LLM与人类偏好保持一致的基本范式。奖励模型的性能在对齐的有效性中起着至关重要的作用。以前的奖励模型以粗粒度的水平运行，需要产生完整的响应才能获得奖励价值。稀疏的奖励可能会给下游增强学习带来挑战。尽管最近的努力试图学习令牌级的奖励模型，但缺乏明确的语义信息使得很难对每个单个代币的信用进行建模。在本文中，我们建议为每个句子分配分数，并引入中间粒度的奖励模型。通过将完整的响应分段到句子中，并应用差异操作以奖励每个句子的开始和结束位置的输出，我们可以有效地对句子的回报进行建模。此外，引入了一种新颖的注意机制，以将所有句子的分数汇总为响应级别的分数，从而可以使用Bradley-Terry模型对其进行培训。在通用基准上，我们的方法在奖励基地（用于奖励建模评估）上优于响应级奖励模型2.7％，并超过了Alpacaeval上的所有基准（用于对齐评估）。

Title: Cyber for AI at SemEval-2025 Task 4: Forgotten but Not Lost: The Balancing Act of Selective Unlearning in Large Language Models

Authors: Dinesh Srivasthav P, Bala Mallikarjunarao Garlapati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04795
Pdf URL: https://arxiv.org/pdf/2503.04795
Copy Paste: [[2503.04795]] Cyber for AI at SemEval-2025 Task 4: Forgotten but Not Lost: The Balancing Act of Selective Unlearning in Large Language Models(https://arxiv.org/abs/2503.04795)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) face significant challenges in maintaining privacy, ethics, and compliance, when sensitive or obsolete data must be selectively removed. Retraining these models from scratch is computationally infeasible, necessitating efficient alternatives. As part of the SemEval 2025 Task 4, this work focuses on the application of selective unlearning in LLMs to address this challenge. In this paper, we present our experiments and findings, primarily leveraging global weight modification to achieve an equilibrium between effectiveness of unlearning, knowledge retention, and target model's post-unlearning utility. We also detail the task-specific evaluation mechanism, results, and challenges. Our algorithms have achieved an aggregate score of 0.409 and 0.389 on the test set for 7B and 1B target models, respectively, demonstrating promising results in verifiable LLM unlearning.
摘要：当必须选择性删除敏感或过时的数据时，大型语言模型（LLMS）在维持隐私，道德和合规性方面面临重大挑战。从头开始训练这些模型在计算上是不可行的，需要有效的替代方案。作为Semeval 2025任务4的一部分，这项工作着重于在LLMS中选择性学习以应对这一挑战的应用。在本文中，我们介绍了我们的实验和发现，主要利用全球重量修改，以在未学习，知识保留和目标模型的未检测效用的有效性之间达到平衡。我们还详细介绍了特定于任务的评估机制，结果和挑战。我们的算法分别在7B和1B目标模型的测试集上达到了0.409和0.389的总成绩，这表明了可验证的LLM Uncorning的有希望的结果。

Title: Optimizing Multi-Hop Document Retrieval Through Intermediate Representations

Authors: Jiaen Lin, Jingyu Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.04796
Pdf URL: https://arxiv.org/pdf/2503.04796
Copy Paste: [[2503.04796]] Optimizing Multi-Hop Document Retrieval Through Intermediate Representations(https://arxiv.org/abs/2503.04796)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) encounters challenges when addressing complex queries, particularly multi-hop questions. While several methods tackle multi-hop queries by iteratively generating internal queries and retrieving external documents, these approaches are computationally expensive. In this paper, we identify a three-stage information processing pattern in LLMs during layer-by-layer reasoning, consisting of extraction, processing, and subsequent extraction steps. This observation suggests that the representations in intermediate layers contain richer information compared to those in other layers. Building on this insight, we propose Layer-wise RAG (L-RAG). Unlike prior methods that focus on generating new internal queries, L-RAG leverages intermediate representations from the middle layers, which capture next-hop information, to retrieve external knowledge. L-RAG achieves performance comparable to multi-step approaches while maintaining inference overhead similar to that of standard RAG. Experimental results show that L-RAG outperforms existing RAG methods on open-domain multi-hop question-answering datasets, including MuSiQue, HotpotQA, and 2WikiMultiHopQA. The code is available in this https URL
摘要：在解决复杂的查询时，尤其是多跳的问题时，检索增强的一代（RAG）会遇到挑战。尽管几种方法通过迭代生成内部查询和检索外部文档来解决多跳的查询，但这些方法在计算上却很昂贵。在本文中，我们在逐层推理期间在LLMS中确定了一个三阶段的信息处理模式，包括提取，处理和随后的提取步骤。该观察结果表明，与其他层中的表示相比，中间层中的表示包含更丰富的信息。在这个见解的基础上，我们提出了层面的抹布（l-rag）。与侧重于生成新内部查询的先前方法不同，l-rag利用中间层中的中间表示（捕获下一跳信息）来检索外部知识。 L-RAG的性能与多步进方法相当，同时保持与标准抹布相似的推理间接费用。实验结果表明，L-RAG在开放域多跳的问题吸收数据集（包括Musique，HotPotQA和2Wikimultihopqa）上胜过现有的抹布方法。该代码可在此HTTPS URL中使用

Title: HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

Authors: Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04800
Pdf URL: https://arxiv.org/pdf/2503.04800
Copy Paste: [[2503.04800]] HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation(https://arxiv.org/abs/2503.04800)
Keywords: language model, llm, retrieval-augmented generation
Abstract: While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures temporal knowledge evolution in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG.
摘要：尽管检索型发电（RAG）已成为解决大语模型（LLMS）中知识过时问题的有效方法，但它面临着一个关键的挑战：知识库中过时的信息的普遍性。当前的研究主要集中于纳入最新信息，但是在检索资源中共存的过时信息的影响仍然不足。为了弥合这一差距，我们介绍了HOH，这是第一个专门设计用于评估过时信息对抹布的影响的基准。我们的基准测试利用令牌级别的差异算法与LLM管道结合使用，以有效地创建一个大型质量检查数据集，该数据集准确地捕获了现实世界中事实中的时间知识的演变。通过全面的实验，我们揭示了过时的信息以两种关键方式显着降低了抹布性能：（1）通过分散正确信息的模型来大大降低响应准确性，（2）即使当前信息可用，它也会误导模型中产生潜在的有害输出。当前的抹布处理在处理过时的信息时，在检索和发电方面都在困难。这些发现凸显了迫切需要创新的解决方案来应对抹布的时间挑战。

Title: Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models

Authors: Boyu Jia, Junzhe Zhang, Huixuan Zhang, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04801
Pdf URL: https://arxiv.org/pdf/2503.04801
Copy Paste: [[2503.04801]] Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models(https://arxiv.org/abs/2503.04801)
Keywords: language model, llm
Abstract: In recent years, multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision. However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning, leading to inconsistencies in reasoning outcomes. To systematically explore this issue, we propose four evaluation tasks and construct a new dataset. We conduct a series of experiments on this dataset to analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs. Based on the experimental results, we identify factors contributing to the observed degradation in consistency. Our research provides new insights into the challenges of multimodal knowledge reasoning and offers valuable guidance for future efforts aimed at improving MLLMs.
摘要：近年来，多模式的大语言模型（MLLM）取得了重大突破，增强了文本和视野之间的理解。但是，当前的MLLM在多模式知识推理期间有效地整合这些模式的知识时仍面临挑战，从而导致推理结果不一致。为了系统地探索此问题，我们提出了四个评估任务并构建一个新数据集。我们在此数据集上进行了一系列实验，以分析和比较MLLM中多模式知识推理中一致性降解的程度。根据实验结果，我们确定了导致一致性下降的降解的因素。我们的研究为多模式知识推理的挑战提供了新的见解，并为旨在改善MLLM的未来努力提供了宝贵的指导。

Title: Call for Rigor in Reporting Quality of Instruction Tuning Data

Authors: Hyeonseok Moon, Jaehyung Seo, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04807
Pdf URL: https://arxiv.org/pdf/2503.04807
Copy Paste: [[2503.04807]] Call for Rigor in Reporting Quality of Instruction Tuning Data(https://arxiv.org/abs/2503.04807)
Keywords: language model, llm
Abstract: Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion.
摘要：指令调整对于调整大型语言模型（LLMS）以与用户意图保持一致至关重要。许多研究强调了教学调整质量（IT）数据的重要性，揭示了IT数据质量与LLM的一致性性能之间的密切相关性。在这些研究中，通常通过评估接受该数据训练的LLM的性能来评估IT数据的质量。但是，我们在这种实践中确定了一个普遍的问题：培训模型的超参数通常是任意选择而没有足够理由的。我们观察到在不同研究中应用的超参数的显着差异，即使使用相同数据训练相同的模型。在这项研究中，我们证明了这种实践引起的潜在问题，并强调需要仔细考虑数据质量。通过有关利马数据质量和一组1,000个羊驼数据点的实验，我们证明了任意的超参数决策可以得出任何任意结论。

Title: Learning from Failures in Multi-Attempt Reinforcement Learning

Authors: Stephen Chung, Wenyu Du, Jie Fu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04808
Pdf URL: https://arxiv.org/pdf/2503.04808
Copy Paste: [[2503.04808]] Learning from Failures in Multi-Attempt Reinforcement Learning(https://arxiv.org/abs/2503.04808)
Keywords: language model, llm
Abstract: Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at this https URL
摘要：DeepSeek R1举例说明的大语言模型（LLM）的加强学习（RL）的最新进步表明，即使是简单的提问任务也可以显着提高LLM的推理能力。在这项工作中，我们通过将任务修改为多空白设置来扩展这种方法。该模型没有在不正确的响应后提供反馈，而不是每个问题产生单个回答，而是进行了多次尝试。多触及任务鼓励模型完善其先前的尝试并提高搜索效率。实验结果表明，即使是经过多攻击任务的小型LLM进行培训时，通过尝试进行更多的尝试进行评估时，精度也明显更高，从45.6％提高了1次尝试，尝试使用数学基准进行了2次尝试，从而提高了52.5％。相比之下，在标准单转弯任务上训练的同一LLM仅表现出边际改进，在评估期间进行了更多尝试时，从42.3％增加到43.2％。结果表明，与标准的单转弯任务相比，经过多项攻击任务培训的LLM在数学基准测试上的性能稍好一些，同时还学习根据用户反馈来更有效地完善其响应。完整代码可在此HTTPS URL上找到

Title: PanguIR Technical Report for NTCIR-18 AEOLLM Task

Authors: Lang Mei, Chong Chen, Jiaxin Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04809
Pdf URL: https://arxiv.org/pdf/2503.04809
Copy Paste: [[2503.04809]] PanguIR Technical Report for NTCIR-18 AEOLLM Task(https://arxiv.org/abs/2503.04809)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) gain widespread attention in both academia and industry, it becomes increasingly critical and challenging to effectively evaluate their capabilities. Existing evaluation methods can be broadly categorized into two types: manual evaluation and automatic evaluation. Manual evaluation, while comprehensive, is often costly and resource-intensive. Conversely, automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria (dominated by reference-based answers). To address these challenges, NTCIR-18 introduced the AEOLLM (Automatic Evaluation of LLMs) task, aiming to encourage reference-free evaluation methods that can overcome the limitations of existing approaches. In this paper, to enhance the evaluation performance of the AEOLLM task, we propose three key methods to improve the reference-free evaluation: 1) Multi-model Collaboration: Leveraging multiple LLMs to approximate human ratings across various subtasks; 2) Prompt Auto-optimization: Utilizing LLMs to iteratively refine the initial task prompts based on evaluation feedback from training samples; and 3) In-context Learning (ICL) Optimization: Based on the multi-task evaluation feedback, we train a specialized in-context example retrieval model, combined with a semantic relevance retrieval model, to jointly identify the most effective in-context learning examples. Experiments conducted on the final dataset demonstrate that our approach achieves superior performance on the AEOLLM task.
摘要：随着大型语言模型（LLMS）在学术界和行业中都广泛关注，有效评估其能力变得越来越重要和具有挑战性。现有的评估方法可以大致分为两种类型：手动评估和自动评估。手动评估虽然全面，但通常是昂贵且资源密集的。相反，自动评估提供了更大的可扩展性，但受其评估标准的局限性（由基于参考的答案主导）的局限性。为了应对这些挑战，NTCIR-18引入了AEOLLM（自动评估LLMS）任务，旨在鼓励可以克服现有方法局限性的无参考评估方法。在本文中，为了提高AEOLLM任务的评估性能，我们提出了三种改善无参考评估的关键方法：1）多模型协作：利用多个LLMS来近似各种子任务的人类评级； 2）提示自动优化：利用LLMS迭代地根据培训样本的评估反馈来完善初始任务提示； 3）内部文化学习（ICL）优化：基于多任务评估反馈，我们培训了一个专门的文本示例检索模型，并结合了语义相关性检索模型，以共同识别最有效的内部内在学习示例。在最终数据集上进行的实验表明，我们的方法在AEOLLM任务上取得了出色的性能。

Title: Multi-Agent System for AI-Assisted Extraction of Narrative Arcs in TV Series

Authors: Roberto Balestri, Guglielmo Pescatore
Subjects: cs.CL, cs.AI, cs.MA, cs.MM
Abstract URL: https://arxiv.org/abs/2503.04817
Pdf URL: https://arxiv.org/pdf/2503.04817
Copy Paste: [[2503.04817]] Multi-Agent System for AI-Assisted Extraction of Narrative Arcs in TV Series(https://arxiv.org/abs/2503.04817)
Keywords: agent
Abstract: Serialized TV shows are built on complex storylines that can be hard to track and evolve in ways that defy straightforward analysis. This paper introduces a multi-agent system designed to extract and analyze these narrative arcs. Tested on the first season of Grey's Anatomy (ABC 2005-), the system identifies three types of arcs: Anthology (self-contained), Soap (relationship-focused), and Genre-Specific (strictly related to the series' genre). Episodic progressions of these arcs are stored in both relational and semantic (vectorial) databases, enabling structured analysis and comparison. To bridge the gap between automation and critical interpretation, the system is paired with a graphical interface that allows for human refinement using tools to enhance and visualize the data. The system performed strongly in identifying Anthology Arcs and character entities, but its reliance on textual paratexts (such as episode summaries) revealed limitations in recognizing overlapping arcs and subtler dynamics. This approach highlights the potential of combining computational and human expertise in narrative analysis. Beyond television, it offers promise for serialized written formats, where the narrative resides entirely in the text. Future work will explore the integration of multimodal inputs, such as dialogue and visuals, and expand testing across a wider range of genres to refine the system further.
摘要：序列化电视节目建立在复杂的故事情节上，这些故事情节很难以违背直接分析的方式跟踪和发展。本文介绍了一个旨在提取和分析这些叙事弧的多机构系统。该系统在灰色解剖结构的第一季（ABC 2005-）进行了测试，该系统识别了三种类型的弧：选集（独立的），肥皂（以关系为中心）和特定体裁（与该系列的类型严格相关）。这些弧的发作进展存储在关系和语义（矢量）数据库中，从而实现结构化分析和比较。为了弥合自动化和关键解释之间的差距，系统与图形接口配对，该界面允许使用工具来增强和可视化数据的工具进行人工改进。该系统在识别选集弧和性格实体方面表现出色，但是其对文本paratexts（例如情节摘要）的依赖揭示了识别重叠的弧和微妙动力学的局限性。这种方法突出了在叙事分析中结合计算和人类专业知识的潜力。除了电视之外，它还为序列化的书面格式提供了希望，叙事完全驻留在文本中。未来的工作将探讨多模式输入（例如对话和视觉效果）的集成，并在更广泛的流派中扩展测试，以进一步完善系统。

Title: Prompting Science Report 1: Prompt Engineering is Complicated and Contingent

Authors: Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04818
Pdf URL: https://arxiv.org/pdf/2503.04818
Copy Paste: [[2503.04818]] Prompting Science Report 1: Prompt Engineering is Complicated and Contingent(https://arxiv.org/abs/2503.04818)
Keywords: language model, llm, prompt
Abstract: This is the first of a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we demonstrate two things: - There is no single standard for measuring whether a Large Language Model (LLM) passes a benchmark, and that choosing a standard has a big impact on how well the LLM does on that benchmark. The standard you choose will depend on your goals for using an LLM in a particular case. - It is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Specifically, we find that sometimes being polite to the LLM helps performance, and sometimes it lowers performance. We also find that constraining the AI's answers helps performance in some cases, though it may lower performance in other cases. Taken together, this suggests that benchmarking AI performance is not one-size-fits-all, and also that particular prompting formulas or approaches, like being polite to the AI, are not universally valuable.
摘要：这是一系列简短报告中的第一部，该报告旨在帮助商业，教育和政策领导者通过严格的测试来了解与AI合作的技术细节。在本报告中，我们演示了两件事： - 没有一个单一的标准来衡量大型语言模型（LLM）是否通过基准测试，而选择标准对LLM在该基准测试方面的表现有很大影响。您选择的标准将取决于您在特定情况下使用LLM的目标。 - 很难事先知道特定的提示方法是否会帮助或损害LLM回答任何特定问题的能力。具体来说，我们发现有时对LLM有礼貌有助于表现，有时会降低性能。我们还发现，在某些情况下，限制AI的答案有助于性能，尽管在其他情况下可能会降低性能。综上所述，这表明基准AI性能不是一定程度的全部，而且特定的提示公式或方法（例如对AI的礼貌）并不是普遍有价值的。

Title: HeTGB: A Comprehensive Benchmark for Heterophilic Text-Attributed Graphs

Authors: Shujie Li, Yuxia Wu, Chuan Shi, Yuan Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04822
Pdf URL: https://arxiv.org/pdf/2503.04822
Copy Paste: [[2503.04822]] HeTGB: A Comprehensive Benchmark for Heterophilic Text-Attributed Graphs(https://arxiv.org/abs/2503.04822)
Keywords: language model
Abstract: Graph neural networks (GNNs) have demonstrated success in modeling relational data primarily under the assumption of homophily. However, many real-world graphs exhibit heterophily, where linked nodes belong to different categories or possess diverse attributes. Additionally, nodes in many domains are associated with textual descriptions, forming heterophilic text-attributed graphs (TAGs). Despite their significance, the study of heterophilic TAGs remains underexplored due to the lack of comprehensive benchmarks. To address this gap, we introduce the Heterophilic Text-attributed Graph Benchmark (HeTGB), a novel benchmark comprising five real-world heterophilic graph datasets from diverse domains, with nodes enriched by extensive textual descriptions. HeTGB enables systematic evaluation of GNNs, pre-trained language models (PLMs) and co-training methods on the node classification task. Through extensive benchmarking experiments, we showcase the utility of text attributes in heterophilic graphs, analyze the challenges posed by heterophilic TAGs and the limitations of existing models, and provide insights into the interplay between graph structures and textual attributes. We have publicly released HeTGB with baseline implementations to facilitate further research in this field.
摘要：图神经网络（GNN）在主要在同质性的假设下对关系数据进行建模方面取得了成功。但是，许多现实图表表现出异质性，其中链接的节点属于不同类别或具有多种属性。此外，许多域中的节点与文本描述相关联，形成异性文本属性图（TAG）。尽管没有意义，但由于缺乏全面的基准，对异性标签的研究仍未得到充实。为了解决这一差距，我们介绍了异性文本属性基准基准（HETGB），这是一种新型的基准测试，其中包括来自不同域的五个现实世界中异性图数据集，其节点具有广泛的文本描述。 HETGB可以在节点分类任务上对GNN，预训练的语言模型（PLM）和共培训方法进行系统评估。通过广泛的基准测试实验，我们在异性图中展示了文本属性的实用性，分析异性标签带来的挑战和现有模型的局限性，并提供有关图形结构和文本属性之间相互作用的见解。我们已经公开发布了HETGB，并具有基线实施，以促进该领域的进一步研究。

Title: Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems

Authors: Mahfuz Ahmed Anik, Abdur Rahman, Azmine Toushik Wasi, Md Manjurul Ahsan
Subjects: cs.CL, cs.AI, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2503.04827
Pdf URL: https://arxiv.org/pdf/2503.04827
Copy Paste: [[2503.04827]] Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems(https://arxiv.org/abs/2503.04827)
Keywords: language model, gpt, agent
Abstract: Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction. Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance, leading to translations that marginalize linguistic diversity. To address these challenges, we propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities. Our approach leverages specialized agents for translation, interpretation, content synthesis, and bias evaluation, ensuring that linguistic accuracy and cultural relevance are preserved. Using CrewAI and LangChain, our system enhances contextual fidelity while mitigating biases through external validation. Comparative analysis shows that our framework outperforms GPT-4o, producing contextually rich and culturally embedded translations, a critical advancement for Indigenous, regional, and low-resource languages. This research underscores the potential of multi-agent AI in fostering equitable, sustainable, and culturally sensitive NLP technologies, aligning with the AI Governance, Cultural NLP, and Sustainable NLP pillars of Language Models for Underserved Communities. Our full experimental codebase is publicly available at: this https URL
摘要：语言是文化认同的基石，但是全球化和主要语言的主导地位使近3,000种语言有灭绝的风险。现有的AI驱动翻译模型优先考虑效率，但通常无法捕捉文化细微差别，惯用表达和历史意义，从而导致翻译使语言多样性边缘化。为了应对这些挑战，我们提出了一个多代理AI框架，旨在在服务不足的语言社区中为文化自适应翻译。我们的方法利用专门的代理来进行翻译，解释，内容综合和偏见评估，以确保保留语言准确性和文化相关性。使用Crewai和Langchain，我们的系统可以增强上下文忠诚度，同时通过外部验证来减轻偏见。比较分析表明，我们的框架优于GPT-4O，产生上下文丰富和文化嵌入的翻译，这是土著，地区和低资源语言的批判性进步。这项研究强调了多代理AI在培养公平，可持续和文化敏感的NLP技术方面的潜力，与AI治理，文化NLP和可持续的NLP语言模型的可持续NLP支柱保持一致。我们的完整实验代码库可公开可用：此HTTPS URL

Title: Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications

Authors: Vishakha Agrawal, Archie Chaudhury, Shreya Agrawal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04828
Pdf URL: https://arxiv.org/pdf/2503.04828
Copy Paste: [[2503.04828]] Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications(https://arxiv.org/abs/2503.04828)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) are fundamentally next-token prediction systems, their practical applications extend far beyond this basic function. From natural language processing and text generation to conversational assistants and software use, LLMs have numerous use-cases, and have already acquired a significant degree of enterprise adoption. To evaluate such models, static evaluation datasets, consisting of a set of prompts and their corresponding ground truths, are often used to benchmark the efficacy of the model for a particular task. In this paper, we provide the basis for a more comprehensive evaluation framework, based upon a traditional game and tool-based architecture that enables a more overarching measurement of a model's capabilities. For simplicity, we provide a generalized foundation that can be extended, without significant alteration, to numerous scenarios, from specific use cases such as supply chain management or financial reasoning, to abstract measurements such as ethics or safety.
摘要：尽管大型语言模型（LLM）从根本上是下一步的预测系统，但其实际应用远远超出了此基本功能。从自然语言处理和文本生成到对话助理和软件使用，LLM具有许多用例，并且已经获得了大量的企业采用。为了评估此类模型，静态评估数据集（由一组提示及其相应的基础真理组成，通常用于基准模型对特定任务的功效。在本文中，我们基于传统的基于游戏和工具的体系结构，为更全面的评估框架提供了基础，该体系结构可以更加高于模型的功能进行衡量。为简单起见，我们提供了一个广义的基础，可以扩展，而没有重大改变，从众多场景（例如供应链管理或财务推理）到诸如伦理或安全等抽象测量值。

Title: Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents

Authors: Jingying Zeng, Hui Liu, Zhenwei Dai, Xianfeng Tang, Chen Luo, Samarth Varshney, Zhen Li, Qi He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04830
Pdf URL: https://arxiv.org/pdf/2503.04830
Copy Paste: [[2503.04830]] Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents(https://arxiv.org/abs/2503.04830)
Keywords: language model, llm, agent
Abstract: With the advancement of conversational large language models (LLMs), several LLM-based Conversational Shopping Agents (CSA) have been developed to help customers answer questions and smooth their shopping journey in e-commerce domain. The primary objective in building a trustworthy CSA is to ensure the agent's responses are accurate and factually grounded, which is essential for building customer trust and encouraging continuous engagement. However, two challenges remain. First, LLMs produce hallucinated or unsupported claims. Such inaccuracies risk spreading misinformation and diminishing customer trust. Second, without providing knowledge source attribution in CSA response, customers struggle to verify LLM-generated information. To address these challenges, we present an easily productionized solution that enables a "citation experience" utilizing In-context Learning (ICL) and Multi-UX-Inference (MUI) to generate responses with citations to attribute its original sources without interfering other existing UX features. With proper UX design, these citation marks can be linked to the related product information and display the source to our customers. In this work, we also build auto-metrics and scalable benchmarks to holistically evaluate LLM's grounding and attribution capabilities. Our experiments demonstrate that incorporating this citation generation paradigm can substantially enhance the grounding of LLM responses by 13.83% on the real-world data. As such, our solution not only addresses the immediate challenges of LLM grounding issues but also adds transparency to conversational AI.
摘要：随着对话大语言模型（LLM）的发展，已经开发了几个基于LLM的对话购物代理（CSA），以帮助客户回答问题并在电子商务领域平滑购物之旅。建立值得信赖的CSA的主要目标是确保代理商的回应准确且实际上是基础，这对于建立客户信任和鼓励持续参与至关重要。但是，仍然存在两个挑战。首先，LLM会产生幻觉或不支持的主张。这种不准确的可能会传播错误信息并减少客户信任。其次，在不提供CSA响应中的知识来源归因的情况下，客户很难验证LLM生成的信息。为了应对这些挑战，我们提出了一种易于生产的解决方案，该解决方案能够利用内部文化学习（ICL）和多ux-unperence（MUI）一种“引用体验”，以引用引用的响应，以归因其原始来源而不干扰其他现有的UX功能。通过适当的UX设计，这些引用标记可以链接到相关的产品信息并向我们的客户显示来源。在这项工作中，我们还构建了自动对象和可扩展的基准测试，以整体评估LLM的接地和归因能力。我们的实验表明，在现实世界中，结合这种引文产生范式可以大大提高LLM响应的接地13.83％。因此，我们的解决方案不仅解决了LLM接地问题的直接挑战，而且还为会话AI增加了透明度。

Title: "Only ChatGPT gets me": An Empirical Analysis of GPT versus other Large Language Models for Emotion Detection in Text

Authors: Florian Lecourt (LIRMM | ADVANSE), Madalina Croitoru (GRAPHIK), Konstantin Todorov (LIRMM | WEB3, LIRMM, WEB3)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04831
Pdf URL: https://arxiv.org/pdf/2503.04831
Copy Paste: [[2503.04831]] "Only ChatGPT gets me": An Empirical Analysis of GPT versus other Large Language Models for Emotion Detection in Text(https://arxiv.org/abs/2503.04831)
Keywords: language model, gpt, llm, chat
Abstract: This work investigates the capabilities of large language models (LLMs) in detecting and understanding human emotions through text. Drawing upon emotion models from psychology, we adopt an interdisciplinary perspective that integrates computational and affective sciences insights. The main goal is to assess how accurately they can identify emotions expressed in textual interactions and compare different models on this specific task. This research contributes to broader efforts to enhance human-computer interaction, making artificial intelligence technologies more responsive and sensitive to users' emotional nuances. By employing a methodology that involves comparisons with a state-of-the-art model on the GoEmotions dataset, we aim to gauge LLMs' effectiveness as a system for emotional analysis, paving the way for potential applications in various fields that require a nuanced understanding of human language.
摘要：这项工作调查了大语模型（LLM）在通过文本中检测和理解人类情绪的能力。利用心理学的情感模型，我们采用了跨学科的观点，将计算和情感科学见解整合在一起。主要目标是评估他们能够确定在文本互动中表达的情绪的准确程度，并比较此特定任务上的不同模型。这项研究有助于更广泛的努力来增强人类计算机的互动，使人工智能技术对用户的情感细微差别更加敏感和敏感。通过采用一种涉及与Goemotions数据集的最先进模型进行比较的方法，我们旨在评估LLMS作为情感分析的系统的有效性，为需要对人类语言有细微理解的各个领域的潜在应用铺平道路。

Title: Extrapolation Merging: Keep Improving With Extrapolation and Merging

Authors: Yiguan Lin, Bin Xu, Yinghao Li, Yang Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04834
Pdf URL: https://arxiv.org/pdf/2503.04834
Copy Paste: [[2503.04834]] Extrapolation Merging: Keep Improving With Extrapolation and Merging(https://arxiv.org/abs/2503.04834)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) require instruction fine-tuning to perform different downstream tasks. However, the instruction fine-tuning phase still demands significant computational resources and labeled data, lacking a paradigm that can improve model performance without additional computational power and data. Model merging aims to enhance performance by combining the parameters of different models, but the lack of a clear optimization direction during the merging process does not always guarantee improved performance. In this paper, we attempt to provide a clear optimization direction for model merging. We first validate the effectiveness of the model extrapolation method during the instruction fine-tuning phase. Then, we propose Extrapolation Merging, a paradigm that can continue improving model performance without requiring extra computational resources or data. Using the extrapolation method, we provide a clear direction for model merging, achieving local optimization search, and consequently enhancing the merged model's performance. We conduct experiments on seven different tasks, and the results show that our method can consistently improve the model's performance after fine-tuning.
摘要：大型语言模型（LLMS）需要微调来执行不同的下游任务。但是，指令微调阶段仍然需要大量的计算资源和标记的数据，缺乏可以改善模型性能而无需其他计算能力和数据的范式。模型合并旨在通过结合不同模型的参数来提高性能，但是在合并过程中缺乏明确的优化方向并不能总是保证提高性能。在本文中，我们试图为模型合并提供明确的优化方向。我们首先验证了指令微调阶段中模型外推法的有效性。然后，我们提出推断合并，这是一个可以继续改善模型性能的范式，而无需额外的计算资源或数据。使用外推方法，我们为模型合并，实现局部优化搜索提供了一个明确的方向，从而增强了合并模型的性能。我们对七个不同的任务进行实验，结果表明，我们的方法可以在微调后一致改善模型的性能。

Title: Framing the Game: How Context Shapes LLM Decision-Making

Authors: Isaac Robinson, John Burden
Subjects: cs.CL, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2503.04840
Pdf URL: https://arxiv.org/pdf/2503.04840
Copy Paste: [[2503.04840]] Framing the Game: How Context Shapes LLM Decision-Making(https://arxiv.org/abs/2503.04840)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed across diverse contexts to support decision-making. While existing evaluations effectively probe latent model capabilities, they often overlook the impact of context framing on perceived rational decision-making. In this study, we introduce a novel evaluation framework that systematically varies evaluation instances across key features and procedurally generates vignettes to create highly varied scenarios. By analyzing decision-making patterns across different contexts with the same underlying game structure, we uncover significant contextual variability in LLM responses. Our findings demonstrate that this variability is largely predictable yet highly sensitive to framing effects. Our results underscore the need for dynamic, context-aware evaluation methodologies for real-world deployments.
摘要：大型语言模型（LLM）越来越多地在不同的环境中部署，以支持决策。尽管现有的评估有效地探讨了潜在的模型功能，但他们经常忽略上下文框架对感知理性决策的影响。在这项研究中，我们介绍了一个新颖的评估框架，该框架系统地在关键特征上有系统地改变评估实例，并且程序会生成小插曲以创建高度多样化的场景。通过分析具有相同基础游戏结构的不同上下文的决策模式，我们发现了LLM响应的明显上下文可变性。我们的发现表明，这种可变性在很大程度上可以预测但对框架效应高度敏感。我们的结果强调了对现实世界部署的动态，上下文感知评估方法的需求。

Title: Three tiers of computation in transformers and in brain architectures

Authors: E Graham, R Granger
Subjects: cs.CL, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.04848
Pdf URL: https://arxiv.org/pdf/2503.04848
Copy Paste: [[2503.04848]] Three tiers of computation in transformers and in brain architectures(https://arxiv.org/abs/2503.04848)
Keywords: language model
Abstract: Specific empirical phenomena spanning human natural language, and mathematical and logical abilities, are rigorously situated in the well-studied grammar-automata (G-A) hierarchy. We identify three tiers and corresponding two transitions within the hierarchy and show their correspondence to the emergence of particular abilities in humans and in transformer-based language models (LMs). These emergent abilities have often been described in terms of "scaling"; we show that it is the transition between tiers, rather than size itself, that determines a system's capabilities. Specifically, humans effortlessly process language yet require specific training to perform arithmetic or logical reasoning tasks; and LMs possess language abilities absent from predecessor systems yet still struggle with logical processing. The resulting principled analyses provide underlying explanatory accounts of both the abilities and shortfalls of these systems, and suggest actionable insights into the expansion of logic abilities in AI systems.
摘要：跨越人类自然语言以及数学和逻辑能力的特定经验现象在精心研究的语法 - automata（G-A）层次结构中严格。我们在层次结构中确定了三个层和对应的两个过渡，并将其与人类和基于变压器的语言模型（LMS）中特定能力的出现表示对应。这些紧急的能力通常是用“缩放”来描述的。我们证明，决定系统功能的是层次的过渡，而不是大小本身之间的过渡。具体而言，人类毫不费力地处理语言，但需要特定的培训来执行算术或逻辑推理任务； LM具有前代系统中缺乏语言能力，但仍在逻辑处理方面挣扎。由此产生的原则分析为这些系统的能力和缺点提供了基本解释，并提出了可行的见解，以了解AI系统中逻辑能力的扩展。

Title: Enhancing Collective Intelligence in Large Language Models Through Emotional Integration

Authors: Likith Kadiyala, Ramteja Sajja, Yusuf Sermet, Ibrahim Demir
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2503.04849
Pdf URL: https://arxiv.org/pdf/2503.04849
Copy Paste: [[2503.04849]] Enhancing Collective Intelligence in Large Language Models Through Emotional Integration(https://arxiv.org/abs/2503.04849)
Keywords: language model, llm
Abstract: This research investigates the integration of emotional diversity into Large Language Models (LLMs) to enhance collective intelligence. Inspired by the human wisdom of crowds phenomenon, where group decisions often outperform individual judgments, we fine-tuned the DarkIdol-Llama-3.1-8B model using Google's GoEmotions dataset and Low-Rank Adaptation (LoRA) to simulate emotionally diverse responses. Evaluating the model on a distance estimation task between Fargo, ND, and Seattle, WA, across 15,064 unique persona configurations, we analyzed how emotional states and social attributes influence decision-making. Our findings demonstrate that emotional integration shapes response patterns while maintaining acceptable prediction accuracy, revealing its potential to enhance artificial collective intelligence. This study provides valuable insights into the interplay of emotional diversity and decision-making in LLMs, suggesting pathways for creating emotionally aware AI systems that balance emotional depth with analytical precision.
摘要：这项研究调查了情绪多样性与大语言模型（LLMS）的整合，以增强集体智慧。受人群现象的智慧的启发，我们使用Google的GoEmotions数据集和低秩适应（LORA）来微调Darkidol-Lalla-3.1-8b模型，以模拟情感多样化的反应。在15,064种独特的角色配置中，评估ND，ND和西雅图的距离估计任务的模型，我们分析了情感状态和社会属性如何影响决策。我们的发现表明，情感整合塑造了响应模式，同时保持了可接受的预测准确性，从而揭示了其增强人造集体智力的潜力。这项研究为LLM中的情绪多样性和决策的相互作用提供了宝贵的见解，这表明了创建情感意识的AI系统的途径，这些AI系统平衡了情绪深度与分析精度。

Title: One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs

Authors: Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04856
Pdf URL: https://arxiv.org/pdf/2503.04856
Copy Paste: [[2503.04856]] One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs(https://arxiv.org/abs/2503.04856)
Keywords: language model, gpt, llm, prompt
Abstract: Despite extensive safety enhancements in large language models (LLMs), multi-turn "jailbreak" conversations crafted by skilled human adversaries can still breach even the most sophisticated guardrails. However, these multi-turn attacks demand considerable manual effort, limiting their scalability. In this work, we introduce a novel approach called Multi-turn-to-Single-turn (M2S) that systematically converts multi-turn jailbreak prompts into single-turn attacks. Specifically, we propose three conversion strategies - Hyphenize, Numberize, and Pythonize - each preserving sequential context yet packaging it in a single query. Our experiments on the Multi-turn Human Jailbreak (MHJ) dataset show that M2S often increases or maintains high Attack Success Rates (ASRs) compared to original multi-turn conversations. Notably, using a StrongREJECT-based evaluation of harmfulness, M2S achieves up to 95.9% ASR on Mistral-7B and outperforms original multi-turn prompts by as much as 17.5% in absolute improvement on GPT-4o. Further analysis reveals that certain adversarial tactics, when consolidated into a single prompt, exploit structural formatting cues to evade standard policy checks. These findings underscore that single-turn attacks - despite being simpler and cheaper to conduct - can be just as potent, if not more, than their multi-turn counterparts. Our findings underscore the urgent need to reevaluate and reinforce LLM safety strategies, given how adversarial queries can be compacted into a single prompt while still retaining sufficient complexity to bypass existing safety measures.
摘要：尽管大型语言模型（LLM）的安全性得到广泛的提高，但由熟练的人类对手制作的多转弯“越狱”对话即使是最复杂的护栏仍然可能违反。但是，这些多转弯攻击需要大量的手动努力，从而限制了它们的可扩展性。在这项工作中，我们介绍了一种名为多转弯转弯（M2S）的新颖方法，该方法系统地将多转弯的越狱提示转换为单转攻击。具体而言，我们提出了三种转换策略 - 连字符，编号和pythonize - 每个保留顺序上下文，但在单个查询中包装。我们对多转弯人类越狱（MHJ）数据集的实验表明，与原始的多转交谈相比，M2通常会增加或保持高攻击成功率（ASRS）。值得注意的是，使用基于强烈的有害性评估，M2在Mistral-7b上实现高达95.9％的ASR，并且在GPT-4O的绝对改进方面，Mistral-7b上的原始多转变提示均高达17.5％。进一步的分析表明，将某些对抗性策略合并为单个提示，利用结构格式提示以逃避标准政策检查。这些发现强调了单转攻击（尽管进行更简单，更便宜）可能比其多扭转的攻击更有效，甚至更多。我们的发现强调了迫切需要重新评估和加强LLM安全策略，因为如何将对抗性查询压缩到一个提示中，同时仍然保留足够的复杂性以绕过现有的安全措施。

Title: Codebook Reduction and Saturation: Novel observations on Inductive Thematic Saturation for Large Language Models and initial coding in Thematic Analysis

Authors: Stefano De Paoli, Walter Stan Mathis
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.04859
Pdf URL: https://arxiv.org/pdf/2503.04859
Copy Paste: [[2503.04859]] Codebook Reduction and Saturation: Novel observations on Inductive Thematic Saturation for Large Language Models and initial coding in Thematic Analysis(https://arxiv.org/abs/2503.04859)
Keywords: language model, llm
Abstract: This paper reflects on the process of performing Thematic Analysis with Large Language Models (LLMs). Specifically, the paper deals with the problem of analytical saturation of initial codes, as produced by LLMs. Thematic Analysis is a well-established qualitative analysis method composed of interlinked phases. A key phase is the initial coding, where the analysts assign labels to discrete components of a dataset. Saturation is a way to measure the validity of a qualitative analysis and relates to the recurrence and repetition of initial codes. In the paper we reflect on how well LLMs achieve analytical saturation and propose also a novel technique to measure Inductive Thematic Saturation (ITS). This novel technique leverages a programming framework called DSPy. The proposed novel approach allows a precise measurement of ITS.
摘要：本文反映了使用大语言模型（LLM）进行主题分析的过程。具体而言，本文处理了LLMS产生的初始代码的分析饱和问题。主题分析是由相互链接阶段组成的完善的定性分析方法。关键阶段是初始编码，分析师将标签分配给数据集的离散组件。饱和度是一种测量定性分析的有效性并与初始代码的复发和重复有关的方法。在本文中，我们反映了LLM的实现分析饱和度的效果，并提出了一种测量电感主题饱和（ITS）的新技术。这种新颖的技术利用了一个名为DSPY的编程框架。提出的新方法允许对其进行精确测量。

Title: TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

Authors: Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Change Jia, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Bin Cui, Tong Yang, Xiangzheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04872
Pdf URL: https://arxiv.org/pdf/2503.04872
Copy Paste: [[2503.04872]] TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation(https://arxiv.org/abs/2503.04872)
Keywords: language model, llm
Abstract: The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.
摘要：在保持表现的同时减少大语模型（LLM）规模的挑战引起了人们的重大关注。但是，现有的方法（例如模型蒸馏和转移学习）通常无法实现高精度。为了解决这一限制，我们介绍了分支机体蒸馏方法，该方法通过两个阶段增强了模型压缩：（1）分支阶段，其中大型教师模型的知识是\ textit {有选择性蒸馏}通过特定于域特定的监督的细调（SFT），将其用于专业学生模型；（2）合并阶段，其中合并了这些学生模型以实现跨域知识转移并改善概括。我们使用DeepSeek-R1作为老师和DeepSeek-R1-Distill-Qwen-32b作为学生来验证我们的蒸馏方法。 The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating较小，高性能的LLM，计算成本和时间减少。

Title: Are Large Language Models Good In-context Learners for Financial Sentiment Analysis?

Authors: Xinyu Wei, Luojia Liu
Subjects: cs.CL, cs.AI, q-fin.CP
Abstract URL: https://arxiv.org/abs/2503.04873
Pdf URL: https://arxiv.org/pdf/2503.04873
Copy Paste: [[2503.04873]] Are Large Language Models Good In-context Learners for Financial Sentiment Analysis?(https://arxiv.org/abs/2503.04873)
Keywords: language model, llm
Abstract: Recently, large language models (LLMs) with hundreds of billions of parameters have demonstrated the emergent ability, surpassing traditional methods in various domains even without fine-tuning over domain-specific data. However, when it comes to financial sentiment analysis (FSA)$\unicode{x2013}$a fundamental task in financial AI$\unicode{x2013}$these models often encounter various challenges, such as complex financial terminology, subjective human emotions, and ambiguous inclination expressions. In this paper, we aim to answer the fundamental question: whether LLMs are good in-context learners for FSA? Unveiling this question can yield informative insights on whether LLMs can learn to address the challenges by generalizing in-context demonstrations of financial document-sentiment pairs to the sentiment analysis of new documents, given that finetuning these models on finance-specific data is difficult, if not impossible at all. To the best of our knowledge, this is the first paper exploring in-context learning for FSA that covers most modern LLMs (recently released DeepSeek V3 included) and multiple in-context sample selection methods. Comprehensive experiments validate the in-context learning capability of LLMs for FSA.
摘要：最近，具有数百十亿个参数的大型语言模型（LLM）证明了新兴能力，即使没有对特定领域的数据进行微调，也超过了各个领域的传统方法。但是，当涉及财务情感分析（FSA）$ \ unicode {x2013} $时，金融AI $ \ unicode的基本任务{x2013} $这些模型经常遇到各种挑战，例如复杂的金融术语，主观的人类情感和歧义的倾向表达。在本文中，我们旨在回答以下基本问题：LLM是否是FSA的良好的秘密学习者？揭示此问题可以就LLM是否可以学会通过将财务文件索引对的概括为新文档的情感分析的概括来解决挑战的信息有益的见解，鉴于在财务特定数据上对这些模型进行填补是困难的，即使不是不可能的话。据我们所知，这是第一篇论文，探讨了FSA的文化学习，涵盖了大多数现代LLM（最近发布的DeepSeek V3）和多种封闭式示例选择方法。全面的实验验证了LLMS在FSA中的文化学习能力。

Title: Memory Is All You Need: Testing How Model Memory Affects LLM Performance in Annotation Tasks

Authors: Joan C. Timoneda, Sebastián Vallejo Vera
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04874
Pdf URL: https://arxiv.org/pdf/2503.04874
Copy Paste: [[2503.04874]] Memory Is All You Need: Testing How Model Memory Affects LLM Performance in Annotation Tasks(https://arxiv.org/abs/2503.04874)
Keywords: language model, gpt, llm
Abstract: Generative Large Language Models (LLMs) have shown promising results in text annotation using zero-shot and few-shot learning. Yet these approaches do not allow the model to retain information from previous annotations, making each response independent from the preceding ones. This raises the question of whether model memory -- the LLM having knowledge about its own previous annotations in the same task -- affects performance. In this article, using OpenAI's GPT-4o and Meta's Llama 3.1 on two political science datasets, we demonstrate that allowing the model to retain information about its own previous classifications yields significant performance improvements: between 5 and 25\% when compared to zero-shot and few-shot learning. Moreover, memory reinforcement, a novel approach we propose that combines model memory and reinforcement learning, yields additional performance gains in three out of our four tests. These findings have important implications for applied researchers looking to improve performance and efficiency in LLM annotation tasks.
摘要：生成的大语言模型（LLMS）在使用零拍和几乎没有学习的文本注释中显示出令人鼓舞的结果。然而，这些方法不允许模型保留以前的注释中的信息，从而使每个响应都独立于前面的响应。这就提出了一个问题，即模型内存 - LLM是否对同一任务中的先前注释有所了解 - 会影响性能。在本文中，使用OpenAI的GPT-4O和Meta的Llama 3.1在两个政治学数据集上，我们证明，允许该模型保留有关其先前分类的信息可实现重大的性能改进：与零射击相比，在5％至25％之间，与零射击和几乎没有射击学习相比。此外，我们提出的一种新颖的方法是结合模型记忆和增强学习的一种新颖方法，在我们的四个测试中的三个测试中可以增加性能提高。这些发现对希望提高LLM注释任务的性能和效率的应用研究人员具有重要意义。

Title: Architecture for a Trustworthy Quantum Chatbot

Authors: Yaiza Aragonés-Soria, Manuel Oriol
Subjects: cs.CL, quant-ph
Abstract URL: https://arxiv.org/abs/2503.04875
Pdf URL: https://arxiv.org/pdf/2503.04875
Copy Paste: [[2503.04875]] Architecture for a Trustworthy Quantum Chatbot(https://arxiv.org/abs/2503.04875)
Keywords: language model, gpt, llm, chat
Abstract: Large language model (LLM)-based tools such as ChatGPT seem useful for classical programming assignments. The more specialized the field, the more likely they lack reliability because of the lack of data to train them. In the case of quantum computing, the quality of answers of generic chatbots is low. C4Q is a chatbot focused on quantum programs that addresses this challenge through a software architecture that integrates specialized LLMs to classify requests and specialized question answering modules with a deterministic logical engine to provide trustworthy quantum computing support. This article describes the latest version (2.0) of C4Q, which delivers several enhancements: ready-to-run Qiskit code for gate definitions and circuit operations, expanded features to solve software engineering tasks such as the travelling salesperson problem and the knapsack problem, and a feedback mechanism for iterative improvement. Extensive testing of the backend confirms the system's reliability, while empirical evaluations show that C4Q 2.0's classification LLM reaches near-perfect accuracy. The evaluation of the result consists in a comparative study with three existing chatbots highlighting C4Q 2.0's maintainability and correctness, reflecting on how software architecture decisions, such as separating deterministic logic from probabilistic text generation impact the quality of the results.
摘要：大型语言模型（LLM）等工具（例如ChatGpt）似乎对经典编程作业有用。该领域越专业，由于缺乏培训数据，它们缺乏可靠性。在量子计算的情况下，通用聊天机器人的答案质量很低。 C4Q是一个聊天机器人，专注于量子程序，该程序通过软件体系结构来应对这一挑战，该软件体系结构集成了专业的LLMS，以将请求和专业的问题与确定性逻辑引擎与确定性逻辑引擎进行分类，以提供可信赖的量子计算支持。本文介绍了C4Q的最新版本（2.0），该版本（2.0）提供了多种增强功能：用于门定义和电路操作的现成运行的Qiskit代码，扩展的功能可以解决软件工程任务，例如旅行销售人员问题和背包问题，以及一种反馈机制，以进行迭代改进。对后端的广泛测试证实了系统的可靠性，而经验评估表明，C4Q 2.0的分类LLM达到了几乎完美的精度。对结果的评估包括一项比较研究，其中三个现有的聊天机器人突出了C4Q 2.0的可维护性和正确性，反映了软件体系结构的决策（例如将确定性逻辑与概率文本生成分开）如何影响结果的质量。

Title: Maximizing Signal in Human-Model Preference Alignment

Authors: Kelsey Kraus, Margaret Kroll
Subjects: cs.CL, stat.ME
Abstract URL: https://arxiv.org/abs/2503.04910
Pdf URL: https://arxiv.org/pdf/2503.04910
Copy Paste: [[2503.04910]] Maximizing Signal in Human-Model Preference Alignment(https://arxiv.org/abs/2503.04910)
Keywords: llm
Abstract: The emergence of powerful LLMs has led to a paradigm shift in Natural Language Understanding and Natural Language Generation. The properties that make LLMs so valuable for these tasks -- creativity, ability to produce fluent speech, and ability to quickly and effectively abstract information from large corpora -- also present new challenges to evaluating their outputs. The rush to market has led teams to fall back on quick, cost-effective automatic evaluations which offer value, but do not obviate the need for human judgments in model training and evaluation. This paper argues that in cases in which end users need to agree with the decisions made by ML models -- e.g. in toxicity detection or extraction of main points for summarization -- models should be trained and evaluated on data that represent the preferences of those users. We support this argument by explicating the role of human feedback in labeling and judgment tasks for model training and evaluation. First, we propose methods for disentangling noise from signal in labeling tasks. Then we show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices, while signal can be maximized to play an integral role in model training and evaluation tasks. Finally, we illustrate best practices by providing a case study in which two guardrails classifiers are evaluated using human judgments to align final model behavior to user preferences. We aim for this paper to provide researchers and professionals with guidelines to integrating human judgments into their ML and generative AI evaluation toolkit, particularly when working toward achieving accurate and unbiased features that align with users' needs and expectations.
摘要：强大的LLM的出现导致自然语言理解和自然语言产生的范式转变。使LLM对这些任务如此有价值的属性 - 创造力，产生流利的语音能力以及能够快速有效地从大型语料库中抽象信息的能力 - 也提出了评估其产出的新挑战。冲向市场已导致团队恢复了具有价值的快速，具有成本效益的自动评估，但并没有消除对模型培训和评估中人类判断的需求。本文认为，在最终用户需要同意ML模型的决定的情况下，例如在摘要的毒性检测或提取要点的毒性检测或提取 - 应根据代表这些用户偏好的数据进行培训和评估模型。我们通过阐明人类反馈在标签和判断任务中的作用来支持这一论点，以进行模型培训和评估。首先，我们提出了在标签任务中将噪声从信号中解散的方法。然后，我们表明，通过遵守证明的方法学最佳实践，可以将标记分歧标记的噪音最小化，而信号可以最大化以在模型培训和评估任务中发挥不可或缺的作用。最后，我们通过提供一个案例研究来说明最佳实践，在该案例研究中，使用人类判断将两个护栏分类器与用户偏好相结合。我们的目的是让本文为研究人员和专业人员提供指导方针，以将人类判断纳入其ML和生成的AI评估工具包中，尤其是在努力实现与用户的需求和期望相符的准确和无偏见的功能时。

Title: HILGEN: Hierarchically-Informed Data Generation for Biomedical NER Using Knowledgebases and Large Language Models

Authors: Yao Ge, Yuting Guo, Sudeshna Das, Swati Rajwal, Selen Bozkurt, Abeed Sarker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04930
Pdf URL: https://arxiv.org/pdf/2503.04930
Copy Paste: [[2503.04930]] HILGEN: Hierarchically-Informed Data Generation for Biomedical NER Using Knowledgebases and Large Language Models(https://arxiv.org/abs/2503.04930)
Keywords: language model, gpt, llm, prompt
Abstract: We present HILGEN, a Hierarchically-Informed Data Generation approach that combines domain knowledge from the Unified Medical Language System (UMLS) with synthetic data generated by large language models (LLMs), specifically GPT-3.5. Our approach leverages UMLS's hierarchical structure to expand training data with related concepts, while incorporating contextual information from LLMs through targeted prompts aimed at automatically generating synthetic examples for sparsely occurring named entities. The performance of the HILGEN approach was evaluated across four biomedical NER datasets (MIMIC III, BC5CDR, NCBI-Disease, and Med-Mentions) using BERT-Large and DANN (Data Augmentation with Nearest Neighbor Classifier) models, applying various data generation strategies, including UMLS, GPT-3.5, and their best ensemble. For the BERT-Large model, incorporating UMLS led to an average F1 score improvement of 40.36%, while using GPT-3.5 resulted in a comparable average increase of 40.52%. The Best-Ensemble approach using BERT-Large achieved the highest improvement, with an average increase of 42.29%. DANN model's F1 score improved by 22.74% on average using the UMLS-only approach. The GPT-3.5-based method resulted in a 21.53% increase, and the Best-Ensemble DANN model showed a more notable improvement, with an average increase of 25.03%. Our proposed HILGEN approach improves NER performance in few-shot settings without requiring additional manually annotated data. Our experiments demonstrate that an effective strategy for optimizing biomedical NER is to combine biomedical knowledge curated in the past, such as the UMLS, and generative LLMs to create synthetic training instances. Our future research will focus on exploring additional innovative synthetic data generation strategies for further improving NER performance.
摘要：我们提出了Hilgen，这是一种层次了解的数据生成方法，它结合了统一医学语言系统（UMLS）的域知识与大语言模型（LLMS）生成的合成数据（特别是GPT-3.5）。我们的方法利用UMLS的层次结构通过相关概念扩展培训数据，同时通过旨在自动生成较少出现的命名实体的综合示例的目标提示来整合LLM的上下文信息。在四个生物医学NER数据集（MIMIC III，BC5CDR，NCBI-DISESE和MED-MENTIONS）中评估了Hilgen方法的性能，并使用Bert-large和Dann（数据增强与最近的邻居分类器的数据增强）模型，应用于包括UMLS，GPT-3.5和他们最佳的ENSEMERM，以及应用各种数据生成策略。对于BERT-LARGE模型，合并UMLS的平均F1得分提高了40.36％，而使用GPT-3.5则导致平均平均增加40.52％。使用Bert-large的最佳方法取得了最大的进步，平均增加了42.29％。 Dann Model的F1得分平均使用仅使用UMLS方法提高了22.74％。基于GPT-3.5的方法可提高21.53％，而最优惠的DANN模型显示出更明显的改善，平均增长25.03％。我们提出的Hilgen方法可以在几个弹药设置中提高NER性能，而无需其他手动注释的数据。我们的实验表明，优化生物医学NER的有效策略是结合过去策划的生物医学知识，例如UMLS和生成LLM，以创建合成训练实例。我们未来的研究将着重于探索其他创新的合成数据生成策略，以进一步提高NER性能。

Title: VQEL: Enabling Self-Developed Symbolic Language in Agents through Vector Quantization in Emergent Language Games

Authors: Mohammad Mahdi Samiei Paqaleh, Mahdieh Soleymani Baghshah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04940
Pdf URL: https://arxiv.org/pdf/2503.04940
Copy Paste: [[2503.04940]] VQEL: Enabling Self-Developed Symbolic Language in Agents through Vector Quantization in Emergent Language Games(https://arxiv.org/abs/2503.04940)
Keywords: agent
Abstract: In the field of emergent language, efforts have traditionally focused on developing communication protocols through interactions between agents in referential games. However, the aspect of internal language learning, where language serves not only as a communicative tool with others but also as a means for individual thinking, self-reflection, and problem-solving remains underexplored. Developing a language through self-play, without another agent's involvement, poses a unique challenge. It requires an agent to craft symbolic representations and train them using direct gradient methods. The challenge here is that if an agent attempts to learn symbolic representations through self-play using conventional modeling and techniques such as REINFORCE, the solution will offer no advantage over previous multi-agent approaches. We introduce VQEL, a novel method that incorporates Vector Quantization into the agents' architecture, enabling them to autonomously invent and develop discrete symbolic representations in a self-play referential game. Following the self-play phase, agents can enhance their language through reinforcement learning and interactions with other agents in the mutual-play phase. Our experiments across various datasets demonstrate that VQEL not only outperforms the traditional REINFORCE method but also benefits from improved control and reduced susceptibility to collapse, thanks to the incorporation of vector quantization.
摘要：在新兴语言领域，传统上的努力集中在通过参考游戏中的代理之间的互动来制定通信协议。但是，内部语言学习的各个方面不仅可以作为与他人的交流工具，而且可以作为个人思维，自我反思和解决问题的一种手段。通过自我玩法开发一种语言，而没有其他代理人的参与，提出了一个独特的挑战。它要求代理商来制作符号表示并使用直接梯度方法训练它们。这里的挑战是，如果代理商试图通过使用常规建模和诸如增强技术的技术来学习符号表示，则该解决方案将与以前的多代理方法相比，没有任何优势。我们介绍了VQEL，这是一种新颖的方法，将向量量化纳入代理的体系结构中，使它们能够自主发明和开发自我播放的参考游戏中的离散符号表示。遵循自我播放阶段，代理可以通过强化学习和在相互播放阶段与其他代理人的互动来增强其语言。我们在各种数据集中进行的实验表明，VQEL不仅胜过传统的增强方法，而且还从改善的控制和降低崩溃的易感性中受益，这要归功于矢量量化。

Title: Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems

Authors: Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, Dongwon Lee
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.04945
Pdf URL: https://arxiv.org/pdf/2503.04945
Copy Paste: [[2503.04945]] Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems(https://arxiv.org/abs/2503.04945)
Keywords: chat
Abstract: The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.
摘要：生成模型的扩散在区分真实的人类作品的内容和深泡含量方面提出了重大挑战。 AI工具增强的协作人类努力提出了一个有希望的解决方案。在这项研究中，我们探讨了一种审议的聊天机器人Deepfakedelibot的潜力，以支持群体检测深泡文本。我们的发现表明，基于群体的问题解决可显着提高与个人努力相比，确定机器生成的段落的准确性。尽管与DeepfakedElibot的互动总体上并不能产生可观的性能，但它通过促进参与者的参与度，共识建立以及基于推理的话语的频率和多样性来增强群体动态。此外，具有较高感知的小组协作有效性的参与者还表现出了Deepfakedelibot的绩效益处。这些发现强调了审议聊天机器人在促进交互式和生产性群体动态方面的潜力，同时确保了协作深层胶合文本检测的准确性。 \ textIt {本研究中使用的数据集和源代码将在接受手稿后公开可用。

Title: DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL

Authors: Haoyuan Ma, Yongliang Shen, Hengwei Liu, Wenqi Zhang, Haolei Xu, Qiuying Peng, Jun Wang, Weiming Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04959
Pdf URL: https://arxiv.org/pdf/2503.04959
Copy Paste: [[2503.04959]] DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL(https://arxiv.org/abs/2503.04959)
Keywords: language model, gpt, llm
Abstract: Recent text-to-SQL systems powered by large language models (LLMs) have demonstrated remarkable performance in translating natural language queries into SQL. However, these systems often struggle with complex database structures and domain-specific queries, as they primarily focus on enhancing logical reasoning and SQL syntax while overlooking the critical need for comprehensive database understanding. To address this limitation, we propose DB-Explore, a novel framework that systematically aligns LLMs with database knowledge through automated exploration and instruction synthesis. DB-Explore constructs database graphs to capture complex relational schemas, leverages GPT-4 to systematically mine structural patterns and semantic knowledge, and synthesizes instructions to distill this knowledge for efficient fine-tuning of LLMs. Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation, bridging the gap between database structures and language models. Experiments conducted on the SPIDER and BIRD benchmarks validate the effectiveness of DB-Explore, achieving an execution accuracy of 52.1% on BIRD and 84.0% on SPIDER. Notably, our open-source implementation, based on the Qwen2.5-coder-7B model, outperforms multiple GPT-4-driven text-to-SQL systems in comparative evaluations, and achieves near state-of-the-art performance with minimal computational cost.
摘要：最近由大语言模型（LLM）提供动力的文本到SQL系统在将自然语言查询转换为SQL方面表现出了显着的性能。但是，这些系统通常在复杂的数据库结构和特定于领域的查询中困难，因为它们主要着重于增强逻辑推理和SQL语法，同时忽略了对全面数据库理解的关键需求。为了解决这一限制，我们提出了DB-explore，这是一个新颖的框架，可以通过自动探索和指导合成系统地将LLM与数据库知识保持一致。 DB探索构造数据库图表以捕获复杂的关系模式，利用GPT-4系统地挖掘结构模式和语义知识，并合成指令以提炼此知识以有效地对LLM进行微调。我们的框架可以通过各种采样策略和自动指令生成来实现全面的数据库理解，从而弥合了数据库结构和语言模型之间的差距。在蜘蛛和鸟基测试基准上进行的实验验证了DB-explore的有效性，在鸟类上实现了52.1％的执行精度，而蜘蛛的执行精度为84.0％。值得注意的是，我们的开源实现基于QWEN2.5-编码7B模型，在比较评估中优于多个GPT-4驱动的文本到SQL系统，并以最低的计算成本实现了近乎最先进的性能。

Title: Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Authors: Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04973
Pdf URL: https://arxiv.org/pdf/2503.04973
Copy Paste: [[2503.04973]] Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning(https://arxiv.org/abs/2503.04973)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
摘要：将外部知识纳入大语言模型（LLMS）可以增强其在不同应用程序中的效用，但是现有的方法具有权衡。检索型发电（RAG）通过相似性搜索获得了证据，但关键信息可能不在排名最高的结果之外。长篇小说模型可以处理多个文档，但在计算上很昂贵，并且受上下文窗口大小的限制。受学生浓缩开放式考试的启发，我们提出了任务感知的键值（KV）缓存压缩，该键值（KV）缓存压缩将在零或几次设置中压缩外部知识。这使LLM可以有效地对所有相关信息的压实表示。实验表明，我们的方法的表现优于抹布和任务不合时宜的压缩方法。在Longbench V2上，它以30倍的压缩率提高了超过抹布的7个绝对点，同时将推断潜伏期从0.43降低到0.16。合成数据集突出显示，当稀疏证据足够时，抹布的性能很好，而任务感知的压缩对于广泛的知识任务是优越的。

Title: Application of integrated gradients explainability to sociopsychological semantic markers

Authors: Ali Aghababaei, Jan Nikadon, Magdalena Formanowicz, Maria Laura Bettinsoli, Carmen Cervone, Caterina Suitner, Tomaso Erseghe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04989
Pdf URL: https://arxiv.org/pdf/2503.04989
Copy Paste: [[2503.04989]] Application of integrated gradients explainability to sociopsychological semantic markers(https://arxiv.org/abs/2503.04989)
Keywords: agent
Abstract: Classification of textual data in terms of sentiment, or more nuanced sociopsychological markers (e.g., agency), is now a popular approach commonly applied at the sentence level. In this paper, we exploit the integrated gradient (IG) method to capture the classification output at the word level, revealing which words actually contribute to the classification process. This approach improves explainability and provides in-depth insights into the text. We focus on sociopsychological markers beyond sentiment and investigate how to effectively train IG in agency, one of the very few markers for which a verified deep learning classifier, BERTAgent, is currently available. Performance and system parameters are carefully tested, alternatives to the IG approach are evaluated, and the usefulness of the result is verified in a relevant application scenario. The method is also applied in a scenario where only a small labeled dataset is available, with the aim of exploiting IG to identify the salient words that contribute to building the different classes that relate to relevant sociopsychological markers. To achieve this, an uncommon training procedure that encourages overfitting is employed to enhance the distinctiveness of each class. The results are analyzed through the lens of social psychology, offering valuable insights.
摘要：现在，根据情感或更细微的社会心理学标记（例如代理）将文本数据分类已成为一种普遍在句子级别上应用的流行方法。在本文中，我们利用集成梯度（IG）方法在单词级别捕获分类输出，从而揭示了哪些单词实际上有助于分类过程。这种方法可提高解释性，并对文本提供深入的见解。我们关注超越情感的社会心理学标记，并研究如何有效地培训代理IG，这是当前可用的验证深度学习分类器Bertagent的极少数标记之一。仔细测试性能和系统参数，评估IG方法的替代方法，并在相关的应用程序方案中验证结果的有用性。该方法还适用于仅可用的小标签数据集的情况，目的是利用IG来识别有助于构建与相关社会心理学标记相关的不同类别的显着词。为了实现这一目标，采用鼓励过度拟合的罕见培训程序来增强每个班级的独特性。结果通过社会心理学的角度进行了分析，提供了宝贵的见解。

Title: DP-GTR: Differentially Private Prompt Protection via Group Text Rewriting

Authors: Mingchen Li, Heng Fan, Song Fu, Junhua Ding, Yunhe Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04990
Pdf URL: https://arxiv.org/pdf/2503.04990
Copy Paste: [[2503.04990]] DP-GTR: Differentially Private Prompt Protection via Group Text Rewriting(https://arxiv.org/abs/2503.04990)
Keywords: language model, llm, prompt
Abstract: Prompt privacy is crucial, especially when using online large language models (LLMs), due to the sensitive information often contained within prompts. While LLMs can enhance prompt privacy through text rewriting, existing methods primarily focus on document-level rewriting, neglecting the rich, multi-granular representations of text. This limitation restricts LLM utilization to specific tasks, overlooking their generalization and in-context learning capabilities, thus hindering practical application. To address this gap, we introduce DP-GTR, a novel three-stage framework that leverages local differential privacy (DP) and the composition theorem via group text rewriting. DP-GTR is the first framework to integrate both document-level and word-level information while exploiting in-context learning to simultaneously improve privacy and utility, effectively bridging local and global DP mechanisms at the individual data point level. Experiments on CommonSense QA and DocVQA demonstrate that DP-GTR outperforms existing approaches, achieving a superior privacy-utility trade-off. Furthermore, our framework is compatible with existing rewriting techniques, serving as a plug-in to enhance privacy protection. Our code is publicly available at this https URL for reproducibility.
摘要：及时隐私至关重要，尤其是在使用在线大语模型（LLM）时，由于提示中经常包含的敏感信息。尽管LLM可以通过文本重写提高促进隐私，但现有的方法主要集中于文档级重写，忽略了文本的丰富，多晶状体表示。该限制将LLM利用限制为特定任务，忽略了其概括和内在学习能力，从而阻碍了实际应用。为了解决这一差距，我们介绍了DP-GTR，这是一个新颖的三阶段框架，该框架通过小组文本重写利用当地差异隐私（DP）和组成定理。 DP-GTR是第一个集成文档级别和单词级信息的框架，同时利用中文学习以同时改善隐私和实用性，有效地在单个数据点级别弥合本地和全局DP机制。关于常识质量质量检查和DOCVQA的实验表明，DP-GTR的表现优于现有方法，实现了卓越的隐私性权衡。此外，我们的框架与现有的重写技术兼容，可作为增强隐私保护的插件。我们的代码可在此HTTPS URL上公开可用，以供可重复使用。

Title: HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model

Authors: Xuheng Cai, Erica Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04996
Pdf URL: https://arxiv.org/pdf/2503.04996
Copy Paste: [[2503.04996]] HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model(https://arxiv.org/abs/2503.04996)
Keywords: language model
Abstract: Egyptian hieroglyphs are found on numerous ancient Egyptian artifacts, but it is common that they are blurry or even missing due to erosion. Existing efforts to restore blurry hieroglyphs adopt computer vision techniques such as CNNs and model hieroglyph recovery as an image classification task, which suffers from two major limitations: (i) They cannot handle severely damaged or completely missing hieroglyphs. (ii) They make predictions based on a single hieroglyph without considering contextual and grammatical information. This paper proposes a novel approach to model hieroglyph recovery as a next word prediction task and use language models to address it. We compare the performance of different SOTA language models and choose LSTM as the architecture of our HieroLM due to the strong local affinity of semantics in Egyptian hieroglyph texts. Experiments show that HieroLM achieves over 44% accuracy and maintains notable performance on multi-shot predictions and scarce data, which makes it a pragmatic tool to assist scholars in inferring missing hieroglyphs. It can also complement CV-based models to significantly reduce perplexity in recognizing blurry hieroglyphs. Our code is available at this https URL.
摘要：埃及象形文字是在许多古埃及文物上发现的，但通常由于侵蚀而模糊甚至缺失很常见。现有的恢复模糊象形文字的努力采用了计算机视觉技术，例如CNN和模型象形文字恢复作为图像分类任务，这受到了两个主要局限性：（i）他们无法处理严重受损或完全缺少象形文字。（ii）他们在不考虑上下文和语法信息的情况下基于单个象形文字做出预测。本文提出了一种新颖的方法，将象形文字恢复作为下一个单词预测任务并使用语言模型来解决它。我们比较了不同的SOTA语言模型的性能，并选择LSTM作为Hierolm的体系结构，因为埃及象形文字中语义的局部亲和力很强。实验表明，Hierolm的精度超过44％，并在多拍预测和稀缺数据上保持了显着的性能，这使其成为一种实用工具，可帮助学者推断缺失的象形文字。它还可以补充基于简历的模型，以显着减少识别模糊象形文字的困惑。我们的代码可在此HTTPS URL上找到。

Title: Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models

Authors: Benyamin Jamialahmadi, Parsa Kavehzadeh, Mehdi Rezagholizadeh, Parsa Farinneya, Hossein Rajabzadeh, Aref Jafari, Boxing Chen, Marzieh Tahaei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05005
Pdf URL: https://arxiv.org/pdf/2503.05005
Copy Paste: [[2503.05005]] Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models(https://arxiv.org/abs/2503.05005)
Keywords: language model, llm
Abstract: Deploying large language models (LLMs) in real-world applications is often hindered by strict computational and latency constraints. While dynamic inference offers the flexibility to adjust model behavior based on varying resource budgets, existing methods are frequently limited by hardware inefficiencies or performance degradation. In this paper, we introduce Balcony, a simple yet highly effective framework for depth-based dynamic inference. By freezing the pretrained LLM and inserting additional transformer layers at selected exit points, Balcony maintains the full model's performance while enabling real-time adaptation to different computational budgets. These additional layers are trained using a straightforward self-distillation loss, aligning the sub-model outputs with those of the full model. This approach requires significantly fewer training tokens and tunable parameters, drastically reducing computational costs compared to prior methods. When applied to the LLaMA3-8B model, using only 0.2% of the original pretraining data, Balcony achieves minimal performance degradation while enabling significant speedups. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip as well as other leading compression techniques on multiple models and at various scales, across a variety of benchmarks.
摘要：严格的计算和延迟约束通常会阻碍在现实世界应用中部署大型语言模型（LLM）。虽然动态推理提供了根据不同资源预算调整模型行为的灵活性，但现有方法通常受到硬件效率低下或性能降级的限制。在本文中，我们介绍了阳台，这是一个简单而高效的基于深度动态推断的框架。通过冻结经过预告片的LLM并在选定的出口点插入其他变压器层，阳台可以保持完整模型的性能，同时实现对不同计算预算的实时适应。这些额外的层是使用直接自动验证损失训练的，将子模型输出与完整模型的输出对齐。这种方法需要更少的训练令牌和可调参数，与先前的方法相比，计算成本大大降低。当应用于Llama3-8B模型时，仅使用原始预审预周化数据的0.2％，阳台可实现最小的性能下降，同时实现了显着的加速。值得注意的是，我们表明，阳台在多种基准测试的多个模型和各种规模上都优于诸如Flextron和LayersKip等最先进的方法，以及其他领先的压缩技术。

Title: Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation

Authors: Bryan Li, Jiaming Luo, Eleftheria Briakou, Colin Cherry
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05010
Pdf URL: https://arxiv.org/pdf/2503.05010
Copy Paste: [[2503.05010]] Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation(https://arxiv.org/abs/2503.05010)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model's zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.
摘要：尽管大型语言模型（LLM）越来越多地用于机器翻译（MT），但它们在医学和法律等专业领域的表现仍然是一个悬而未决的挑战。先前的工作表明，通过检索有针对性的少量演示或术语以将LLMS适应域，以将其包含在提示中。同时，对于通用LLM MT，最近的研究发现，在翻译之前，从LLM本身产生了类似有用的域知识。我们的工作研究通过仔细提示设置，通过LLMS适应域名MT，发现演示始终超过术语，并且检索始终超过生成。我们发现，使用较弱的模型生成示范可以通过更大的模型的零击性能来缩小差距。鉴于演示的有效性，我们进行了详细的分析以了解其价值。我们发现，域特异性尤其重要，并且流行的多域基准测试正在测试对特定写作风格的适应，而不是对特定领域的改编。

Title: Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety

Authors: Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, Ding Zhao
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2503.05021
Pdf URL: https://arxiv.org/pdf/2503.05021
Copy Paste: [[2503.05021]] Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety(https://arxiv.org/abs/2503.05021)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Finetuning for interpretable LLM Safety (Rational), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. Rational employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.
摘要：大型语言模型（LLMS）容易受到越狱攻击的影响，这些攻击利用了传统的安全路线中的弱点，这通常依赖于严格的拒绝启发式方法或代表工程来阻止有害产量。尽管它们有效地直接对抗攻击，但它们面临着更为细微的，背景意识决策的更广泛的安全挑战。为了解决这个问题，我们提出了可解释的LLM安全性（理性）的推理增强的鉴定，这是一个新颖的框架，训练模型以在响应之前从事明确的安全推理。微型模型利用自我生成的推理中广泛的预处理知识，通过结构化推理，内部化上下文敏感的决策来引导自己的安全。我们的发现表明，安全性超出了拒绝，需要上下文意识，以实现更健壮，可解释和适应性的反应。推理不仅是LLM的核心能力，也是LLM安全性的基本机制。理性采用推理增强的微调，允许其拒绝有害的提示，同时在复杂的场景中提供有意义的和上下文感知的响应。

Title: Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

Authors: Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, Nanyun Peng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.05037
Pdf URL: https://arxiv.org/pdf/2503.05037
Copy Paste: [[2503.05037]] Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence(https://arxiv.org/abs/2503.05037)
Keywords: llm, retrieval-augmented generation
Abstract: Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.
摘要：密集的检索模型通常用于信息检索（IR）应用中，例如检索功能生成（RAG）。由于它们通常是这些系统中的第一步，因此它们的鲁棒性对于避免失败至关重要。在这项工作中，通过重新利用关系提取数据集（例如，重新转移），我们设计了受控的实验来量化启发式偏见的影响，例如在龙+和Chrodiever等猎犬中偏爱较短的文档。我们的发现揭示了很大的脆弱性：检索员通常依赖于表面的模式，例如过度优先的文档开始，较短的文档，重复的实体和文字匹配。此外，他们倾向于忽略文档是否包含查询的答案，缺乏深厚的语义理解。值得注意的是，当多个偏见结合在一起时，模型会表现出灾难性的性能退化，在不到偏见的文档中，在没有答案的情况下，在不到3％的情况下选择了含答案的文档。此外，我们表明这些偏见对诸如抹布之类的下游应用程序有直接的后果，在这里检索偏爱的文档可能会误导LLM，导致34％的性能下降，而不是根本不提供任何文档。

Title: Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference

Authors: Grace Proebsting, Adam Poliak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05047
Pdf URL: https://arxiv.org/pdf/2503.05047
Copy Paste: [[2503.05047]] Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference(https://arxiv.org/abs/2503.05047)
Keywords: language model, gpt, llm, chat
Abstract: We test whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases like NLP datasets elicited from crowd-source workers. We recreate a portion of the Stanford Natural Language Inference corpus using GPT-4, Llama-2 70b for Chat, and Mistral 7b Instruct. We train hypothesis-only classifiers to determine whether LLM-elicited NLI datasets contain annotation artifacts. Next, we use pointwise mutual information to identify the words in each dataset that are associated with gender, race, and age-related terms. On our LLM-generated NLI datasets, fine-tuned BERT hypothesis-only classifiers achieve between 86-96% accuracy. Our analyses further characterize the annotation artifacts and stereotypical biases in LLM-generated datasets.
摘要：我们测试使用大型语言模型（LLMS）创建的NLP数据集包含注释伪像和社交偏见，例如从众包工人中引起的NLP数据集。我们使用GPT-4，Llama-2 70B进行聊天和Mismtral 7b指令重新创建了斯坦福大学自然语言的一部分。我们训练仅假设的分类器，以确定LLM引用的NLI数据集是否包含注释伪像。接下来，我们使用点上的共同信息来识别每个数据集中与性别，种族和与年龄相关的术语相关的单词。在我们的LLM生成的NLI数据集上，微调的BERT假设分类器的精度在86-96％之间。我们的分析进一步表征了LLM生成数据集中的注释伪像和刻板印象偏见。

Title: Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets

Authors: Preetam Prabhu Srikar Dammu, Himanshu Naidu, Chirag Shah
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05049
Pdf URL: https://arxiv.org/pdf/2503.05049
Copy Paste: [[2503.05049]] Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets(https://arxiv.org/abs/2503.05049)
Keywords: language model, llm
Abstract: As question answering (QA) systems advance alongside the rapid evolution of foundation models, the need for robust, adaptable, and large-scale evaluation benchmarks becomes increasingly critical. Traditional QA benchmarks are often static and publicly available, making them susceptible to data contamination and memorization by large language models (LLMs). Consequently, static benchmarks may overestimate model generalization and hinder a reliable assessment of real-world performance. In this work, we introduce Dynamic-KGQA, a scalable framework for generating adaptive QA datasets from knowledge graphs (KGs), designed to mitigate memorization risks while maintaining statistical consistency across iterations. Unlike fixed benchmarks, Dynamic-KGQA generates a new dataset variant on every run while preserving the underlying distribution, enabling fair and reproducible evaluations. Furthermore, our framework provides fine-grained control over dataset characteristics, supporting domain-specific and topic-focused QA dataset generation. Additionally, Dynamic-KGQA produces compact, semantically coherent subgraphs that facilitate both training and evaluation of KGQA models, enhancing their ability to leverage structured knowledge effectively. To align with existing evaluation protocols, we also provide static large-scale train/test/validation splits, ensuring comparability with prior methods. By introducing a dynamic, customizable benchmarking paradigm, Dynamic-KGQA enables a more rigorous and adaptable evaluation of QA systems.
摘要：随着问题回答（QA）系统随着基础模型的快速发展，对强大，适应性和大规模评估基准的需求变得越来越重要。传统的质量检查基准通常是静态的且公开的，使它们容易受到大型语言模型（LLM）的数据污染和记忆的影响。因此，静态基准可能高估模型的概括，并阻碍对现实性能的可靠评估。在这项工作中，我们介绍了Dynamic-kgqa，这是一个可扩展的框架，用于从知识图（KGS）生成自适应质量检查数据集（KGS），旨在减轻记忆风险，同时在跨迭代中保持统计一致性。与固定的基准测试不同，Dynamic-kGQA在每次运行中都会生成一个新的数据集变体，同时保留了基础分布，从而实现了公平且可重复的评估。此外，我们的框架提供了对数据集特性的细粒度控制，从而支持了特定于域的质量质量质量标准数据集的生成。此外，动态kgqa会产生紧凑的，语义上的连贯子图，可促进KGQA模型的训练和评估，从而增强其有效利用结构化知识的能力。为了与现有的评估协议保持一致，我们还提供静态的大规模火车/测试/验证拆分，以确保与先前方法的可比性。通过引入动态，可自定义的基准测量范式，动态kgqa可以对质量检查系统进行更严格和适应性的评估。

Title: A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLMs

Authors: Melkamu Abay Mersha, Mesay Gemeda Yigezu, Hassan shakil, Ali Al shami, Sanghyun Byun, Jugal Kalita
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05050
Pdf URL: https://arxiv.org/pdf/2503.05050
Copy Paste: [[2503.05050]] A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLMs(https://arxiv.org/abs/2503.05050)
Keywords: llm
Abstract: The increasing complexity of LLMs presents significant challenges to their transparency and interpretability, necessitating the use of eXplainable AI (XAI) techniques to enhance trustworthiness and usability. This study introduces a comprehensive evaluation framework with four novel metrics for assessing the effectiveness of five XAI techniques across five LLMs and two downstream tasks. We apply this framework to evaluate several XAI techniques LIME, SHAP, Integrated Gradients, Layer-wise Relevance Propagation (LRP), and Attention Mechanism Visualization (AMV) using the IMDB Movie Reviews and Tweet Sentiment Extraction datasets. The evaluation focuses on four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. Our results show that LIME consistently achieves high scores across multiple LLMs and evaluation metrics, while AMV demonstrates superior Robustness and near-perfect Consistency. LRP excels in Contrastivity, particularly with more complex models. Our findings provide valuable insights into the strengths and limitations of different XAI methods, offering guidance for developing and selecting appropriate XAI techniques for LLMs.
摘要：LLM的日益复杂性对其透明度和解释性提出了重大挑战，因此需要使用可解释的AI（XAI）技术来增强可信度和可用性。这项研究介绍了一个全面的评估框架，该框架具有四个新型指标，用于评估五个LLM和两个下游任务中五种XAI技术的有效性。我们使用此框架来评估几种XAI技术石灰，摇摆，集成梯度，层面相关性传播（LRP）以及注意机制可视化（AMV），并使用IMDB电影评论和推文情感提取数据集进行了视觉机制可视化（AMV）。评估的重点是四个关键指标：人类策划协议（HA），鲁棒性，一致性和对比度。我们的结果表明，石灰在多个LLM和评估指标之间始终达到高分，而AMV表现出了较高的鲁棒性和近乎完美的一致性。 LRP在对比度上擅长，尤其是在更复杂的模型中。我们的发现为不同XAI方法的优势和局限性提供了宝贵的见解，为开发和选择适用于LLM的XAI技术提供了指导。

Title: ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports

Authors: Yosuke Yamagishi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05060
Pdf URL: https://arxiv.org/pdf/2503.05060
Copy Paste: [[2503.05060]] ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports(https://arxiv.org/abs/2503.05060)
Keywords: language model
Abstract: Objective: This study aims to evaluate and compare the performance of two Japanese language models-conventional Bidirectional Encoder Representations from Transformers (BERT) and the newer ModernBERT-in classifying findings from chest CT reports, with a focus on tokenization efficiency, processing time, and classification performance. Methods: We conducted a retrospective study using the CT-RATE-JPN dataset containing 22,778 training reports and 150 test reports. Both models were fine-tuned for multi-label classification of 18 common chest CT conditions. The training data was split in 18,222:4,556 for training and validation. Performance was evaluated using F1 scores for each condition and exact match accuracy across all 18 labels. Results: ModernBERT demonstrated superior tokenization efficiency, requiring 24.0% fewer tokens per document (258.1 vs. 339.6) compared to BERT Base. This translated to significant performance improvements, with ModernBERT completing training in 1877.67 seconds versus BERT's 3090.54 seconds (39% reduction). ModernBERT processed 38.82 samples per second during training (1.65x faster) and 139.90 samples per second during inference (1.66x faster). Despite these efficiency gains, classification performance remained comparable, with ModernBERT achieving superior F1 scores in 8 conditions, while BERT performed better in 4 conditions. Overall exact match accuracy was slightly higher for ModernBERT (74.67% vs. 72.67%), though this difference was not statistically significant (p=0.6291). Conclusion: ModernBERT offers substantial improvements in tokenization efficiency and training speed without sacrificing classification performance. These results suggest that ModernBERT is a promising candidate for clinical applications in Japanese radiology reports analysis.
摘要：目的：本研究旨在评估和比较来自变形金刚（BERT）的两种日本语言模型双向编码器的表现，以及胸部CT报告中新的现代伯特·纳入分类结果，重点介绍了令牌化效率，处理时间和分类性能。方法：我们使用包含22,778个培训报告和150个测试报告的CT-率-JPN数据集进行了回顾性研究。两种模型均经过微调，用于对18个常见胸部CT条件的多标签分类。培训数据在18,222：4,556中进行了培训和验证。使用每种条件的F1分数评估性能，并在所有18个标签中精确匹配精度。结果：与BERT基础相比，ModernBert证明了具有优越的令牌化效率，每个文档的令牌（258.1 vs. 339.6）需要少24.0％。这转化为绩效的重大改进，现代伯特在1877.67秒与伯特的3090.54秒（降低39％）进行了训练。 Modernbert在训练期间每秒处理了38.82个样品（快速1.65倍），推理期间每秒处理139.90个样品（快1.66倍）。尽管取得了这些效率的提高，但分类性能仍然相当，现代伯特在8个条件下取得了卓越的F1得分，而BERT在4个条件下的表现更好。对于ModernBert（74.67％vs. 72.67％）来说，总体确切的匹配准确性略高，尽管这种差异在统计学上并不显着（P = 0.6291）。结论：Modernbert在不牺牲分类绩效的情况下对令牌化效率和训练速度进行了实质性提高。这些结果表明，现代伯特是日本放射学报告分析中临床应用的有前途的候选人。

Title: No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Authors: Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05061
Pdf URL: https://arxiv.org/pdf/2503.05061
Copy Paste: [[2503.05061]] No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding(https://arxiv.org/abs/2503.05061)
Keywords: language model, gpt, llm
Abstract: LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.
摘要：LLM-AS-A-Gudge是一个使用LLM（大型语言模型）评估自然语言文本质量的框架 - 通常是LLM生成的文本。由于其相对低成本，易用性以及与人类风格偏好的相关性，该框架具有巨大的希望。但是，LLM法官已被证明表现出可能扭曲其判断力的偏见。我们评估了LLM法官如何对会话问题的给定回答进行评分，这对于估计整体响应质量至关重要。为此，我们创建并公开发布了一个人类通知的数据集，其标签具有1,200 LLM响应的正确性。我们从现有数据集和为此分析创建的新颖，具有挑战性的基准（BFF基准）的组合中提取问题。我们证明了LLM正确回答问题的能力和对该问题的评分回答之间的牢固联系。尽管总体统计数据可能意味着法官与人类注释者有很高的一致性，但它将在无法回答的问题的子集中挣扎。为了解决这个问题，我们建议一个简单的解决方案：为法官提供正确的人为写的参考答案。我们对参考质量如何影响LLM法官的表现进行了深入的分析。我们表明，与具有综合参考的更强大的法官（例如GPT-4O）相比，提供更高质量参考的法官（例如QWEN 2.5 7b）与人类注释者的同意更高。

Title: S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

Authors: Feng Jiang, Zhiyu Lin, Fan Bu, Yuhao Du, Benyou Wang, Haizhou Li
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.05085
Pdf URL: https://arxiv.org/pdf/2503.05085
Copy Paste: [[2503.05085]] S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information(https://arxiv.org/abs/2503.05085)
Keywords: language model, gpt, llm
Abstract: The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.
摘要：大型语言模型（LLM）的快速发展引起了人们对语音模型的极大关注，尤其是支持语音输入和输出的Speech2speech协议的最新进展。但是，现有基准采用自动基于文本的评估者来评估这些模型的能力以下能力的指导，在语音理解和发电中都缺乏对副语言信息的考虑。为了解决这些问题，我们介绍了S2S-Arena，这是一种新颖的竞技场风格的S2S基准，该基准在跨实际任务中评估了具有副语言信息的指导跟踪功能。我们设计了154个样本，将TT和现场录音融合到四个领域中，并与21个任务一起使用，并以舞台风格的方式手动评估现有的流行语音模型。实验结果表明：（1）除了GPT-4O的出色性能外，cascaded ASR，LLM和TTS的语音模型在Speech2Speech协议中的文本语音对齐后，超过了训练有素的模型；（2）考虑副语言信息，语音模型的知识差异主要取决于LLM骨架，并且该词根的多语言支持受语音模块的限制；（3）出色的语音模型已经可以理解语音输入中的副语言信息，但是使用副语言信息生成适当的音频仍然是一个挑战。

Title: SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding

Authors: Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, Qingjiang Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05096
Pdf URL: https://arxiv.org/pdf/2503.05096
Copy Paste: [[2503.05096]] SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding(https://arxiv.org/abs/2503.05096)
Keywords: language model, llm
Abstract: Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to varying workloads and system environments, resulting in performance variability and SLO violations. In this paper, we introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. SpecServe proposes a theoretical model to understand and predict the efficiency of speculative decoding across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to guarantee optimal performance while achieving high SLO attainment. Experimental results on real-world LLM traces demonstrate that SpecServe consistently meets SLOs and achieves substantial performance improvements, yielding 1.14$\times$-14.3$\times$ speedups over state-of-the-art speculative inference systems.
摘要：大型语言模型（LLM）服务通常在动态请求模式下达到低推理潜伏期和满足服务水平目标（SLO）的挑战。利用轻量化模型进行制图和LLM进行验证的投机解码已成为一种令人信服的技术，以加速LLM推断。但是，现有的投机解码解决方案通常无法适应不同的工作负载和系统环境，从而导致性能变异性和SLO违规。在本文中，我们介绍了SpecServe，这是一种有效的LLM推理系统，该系统根据实时请求负载和系统配置动态调整投机策略。 Specserve提出了一个理论模型，以了解和预测各种情况下投机解码的效率。此外，它实现了智能的起草和验证算法，以确保获得最佳的SLO成就。现实世界中LLM痕迹的实验结果表明，Specserve始终达到SLOS并实现了实质性的改进，从而产生了1.14 $ \ times $ -14.3 $ \ times $ speedups $ speedups，而不是先进的投机推理系统。

Title: RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

Authors: Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05142
Pdf URL: https://arxiv.org/pdf/2503.05142
Copy Paste: [[2503.05142]] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist(https://arxiv.org/abs/2503.05142)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at this https URL .
摘要：在各种和挑战性的场景中评估大型语言模型（LLM）对于与人类偏好保持一致至关重要。为了减轻与人类评估相关的高昂成本，利用强大的LLM作为法官成为一种受欢迎的方法。然而，这种方法遇到了一些挑战，包括大量费用，对隐私和安全性以及可重复性的担忧。在本文中，我们提出了一种直接，可复制和准确的自动化评估方法，它利用轻巧的LLM作为Rocketeval的法官。最初，我们确定在评估任务中轻巧和强大的LLM之间的性能差异主要源于它们进行全面分析的能力，这不容易通过诸如思想链推理等技术来增强。通过使用特定实例的清单将评估任务重新标记为多面问答，我们证明了轻量级LLM的判断准确性有限，这在很大程度上归因于高不确定性和位置偏见。为了应对这些挑战，我们引入了以清单分级为基础的自动评估过程，该过程旨在适应各种情况和问题。此过程涵盖了清单的创建，轻量级LLMS的这些清单的评分以及重新加权项目以与监督注释保持一致。我们的实验在自动评估基准，MT板凳和野生台数据集上进行，Rocketeval在使用Gemma-2-2b作为法官时，与人类的偏好达到了高度相关性（0.965），这与GPT-4O相当。此外，对于大规模评估和比较方案，RocketeVal提供的成本降低超过50倍。我们的代码可在此HTTPS URL上找到。

Title: Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History

Authors: Bowen Wu, Wenqing Wang, Haoran Li, Ying Li, Jingsong Yu, Baoxun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05150
Pdf URL: https://arxiv.org/pdf/2503.05150
Copy Paste: [[2503.05150]] Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History(https://arxiv.org/abs/2503.05150)
Keywords: chat, retrieval augmented generation, agent
Abstract: Proactive dialogue systems aim to empower chatbots with the capability of leading conversations towards specific targets, thereby enhancing user engagement and service autonomy. Existing systems typically target pre-defined keywords or entities, neglecting user attributes and preferences implicit in dialogue history, hindering the development of long-term user intimacy. To address these challenges, we take a radical step towards building a more human-like conversational agent by integrating proactive dialogue systems with long-term memory into a unified framework. Specifically, we define a novel task named Memory-aware Proactive Dialogue (MapDia). By decomposing the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive Dataset (ChMapData). Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation, designed to steer dialogues towards relevant historical topics at the right time. The effectiveness of our dataset and models is validated through both automatic and human evaluations. We release the open-source framework and dataset at this https URL.
摘要：主动的对话系统旨在使聊天机器人能够将对话带入特定目标的能力，从而增强用户参与度和服务自主权。现有系统通常针对预定义的关键字或实体，忽略用户属性和对话历史上隐含的偏好，从而阻碍了长期用户亲密关系的发展。为了应对这些挑战，我们通过将主动对话系统与长期记忆整合到统一的框架中，朝着建立更类似人类的对话代理人迈出了根本性的一步。具体来说，我们定义了一个名为“内存意识到主动对话”（MAPDIA）的新任务。通过分解任务，我们提出了一种自动数据构建方法，并创建第一个中文记忆意识的主动数据集（CHMAPDATA）。此外，我们引入了一个基于检索增强发电（RAG）的联合框架，其中包含三个模块：主题摘要，主题检索和主动主题转换检测和生成，旨在在适当的时间引导对话到相关的历史主题。我们的数据集和模型的有效性通过自动和人类评估得到验证。我们在此HTTPS URL上发布开源框架和数据集。

Title: Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

Authors: Ruixi Lin, Ziqiao Wang, Yang You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05157
Pdf URL: https://arxiv.org/pdf/2503.05157
Copy Paste: [[2503.05157]] Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy(https://arxiv.org/abs/2503.05157)
Keywords: language model, llm, prompt
Abstract: Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a post-hoc nonlinear integer programming based debiasing method that ensembles weight correction and membership correction to enable flexible rectifications of class probabilities at both class and sample levels, enhancing the performance of LLMs directly from their outputs. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. The resulted probability correction scheme demonstrates that sample-level corrections are necessary to elevate weak classes. In addition, due to effectively correcting weak classes, our method also brings significant performance gains to Llama-2-70B, especially on a biomedical domain task, demonstrating its effectiveness across both small and large model variants.
摘要：语言模型是很强的学习者，并且在文本分类任务中实现了良好的总体准确性，掩盖了他们的结果遭受较高的班级准确性不平衡的事实。我们认为，对整体准确性的追求不应来自丰富强大的阶级，而应来自较弱的阶层。为了解决不平衡，我们提出了基于事后的非线性整数基于分解方法，该方法结合了权重校正和成员校正，以使类别和样本水平的类别概率的灵活矫正构成直接从其输出中提高LLM的性能。对七个文本分类基准的Llama-2-13b进行评估表明，我们的方法可以通过平衡的班级准确性实现最先进的总体准确性提高。所得的概率校正方案表明，样品级校正对于提升弱类是必要的。此外，由于有效纠正弱类，我们的方法还为Llama-2-70B带来了显着的性能增长，尤其是在生物医学领域任务上，这表明了其在小型和大型模型变体中的有效性。

Title: Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Authors: Simon A. Aytes, Jinheon Baek, Sung Ju Hwang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05179
Pdf URL: https://arxiv.org/pdf/2503.05179
Copy Paste: [[2503.05179]] Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching(https://arxiv.org/abs/2503.05179)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models have demonstrated remarkable reasoning capabilities through Chain of Thought (CoT) prompting, but often at the cost of excessive verbosity in their intermediate outputs, which increases computational overhead. We introduce Sketch-of-Thought (SoT), a novel prompting framework that combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage while preserving reasoning accuracy. SoT is designed as a flexible framework that can incorporate any custom reasoning paradigms based on cognitive science, and we instantiate it with three such paradigms - Conceptual Chaining, Chunked Symbolism, and Expert Lexicons - each tailored to different reasoning tasks and selected dynamically via a lightweight routing model. Through comprehensive evaluation across 15 reasoning datasets with multiple languages and multimodal scenarios, we demonstrate that SoT achieves token reductions of 76% with negligible accuracy impact. In certain domains like mathematical and multi-hop reasoning, it even improves accuracy while using significantly fewer tokens. Our code is publicly available: this https URL.
摘要：大型语言模型的最新进展表现出了通过思想链（COT）提示的显着推理能力，但通常以中间输出过度详细的陈述，这增加了计算开销。我们介绍了素描（SOT），这是一个新颖的提示框架，将认知启发的推理范式与语言约束结合在一起，以最大程度地减少令牌用法，同时保持推理精度。 SOT被设计为一个灵活的框架，可以通过认知科学结合任何自定义的推理范式，我们将其实例化，以三个这样的范式实例化 - 概念链接，块状象征和专家词典 - 每个范式都适合不同的推理任务，并通过轻量级的路由模型动态选择。通过对具有多种语言和多模式场景的15个推理数据集进行全面评估，我们证明了SOT可实现76％的代币减少，而准确性可忽略不计。在某些域（例如数学和多跳推理）中，它甚至可以提高准确性，同时使用明显更少的令牌。我们的代码公开可用：此HTTPS URL。

Title: Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning

Authors: Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05188
Pdf URL: https://arxiv.org/pdf/2503.05188
Copy Paste: [[2503.05188]] Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning(https://arxiv.org/abs/2503.05188)
Keywords: llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.
摘要：促进链（COT）提示在不同的推理任务下表现出不同的性能。先前的工作试图对其进行评估，但在对影响COT的模式的深入分析方面缺乏。在本文中，我们从有效性和忠诚的角度研究了COT的性能。对于前者，我们确定影响COT有效性改善绩效的关键因素，包括问题困难，信息增益和信息流。对于后者，我们通过对问题，COT和答案之间的信息相互作用进行联合分析来解释不忠的COT问题。结果表明，当LLM预测答案时，它可以回忆起COT中缺少的问题，从而导致问题。最后，我们提出了一种新颖的算法来减轻此问题，其中我们回想起问题中的额外信息，以增强COT的生成并根据其信息增益来评估COTS。广泛的实验表明，我们的方法增强了COT的忠诚和有效性。

Title: Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning

Authors: Mufan Xu, Gewen Liang, Kehai Chen, Wei Wang, Xun Zhou, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05193
Pdf URL: https://arxiv.org/pdf/2503.05193
Copy Paste: [[2503.05193]] Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning(https://arxiv.org/abs/2503.05193)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable performance on knowledge graph question answering (KGQA) tasks by planning and interacting with knowledge graphs. However, existing methods often confuse tool utilization with knowledge reasoning, harming readability of model outputs and giving rise to hallucinatory tool invocations, which hinder the advancement of KGQA. To address this issue, we propose Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning (MemQ) to decouple LLM from tool invocation tasks using LLM-built query memory. By establishing a memory module with explicit descriptions of query statements, the proposed MemQ facilitates the KGQA process with natural language reasoning and memory-augmented query reconstruction. Meanwhile, we design an effective and readable reasoning to enhance the LLM's reasoning capability in KGQA. Experimental results that MemQ achieves state-of-the-art performance on widely used benchmarks WebQSP and CWQ.
摘要：大型语言模型（LLM）通过计划和与知识图进行互动，在知识图答案（KGQA）任务上取得了出色的性能。但是，现有方法通常将工具利用与知识推理混淆，损害模型输出的可读性并引起幻觉工具调用，这阻碍了KGQA的进步。为了解决此问题，我们建议使用LLM-Built查询内存中的基于LLM的知识图（MEMQ）的内存调查重建（MEMQ）从工具调用任务中解除LLM。通过使用明确的查询语句描述建立一个内存模块，提出的MEMQ通过自然语言推理和记忆启动的查询重建促进了KGQA过程。同时，我们设计了一种有效且可读性的推理，以增强LLM在KGQA中的推理能力。 MEMQ在广泛使用基准WebQSP和CWQ上实现最新性能的实验结果。

Title: ORANSight-2.0: Foundational LLMs for O-RAN

Authors: Pranshav Gajjar, Vijay K. Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05200
Pdf URL: https://arxiv.org/pdf/2503.05200
Copy Paste: [[2503.05200]] ORANSight-2.0: Foundational LLMs for O-RAN(https://arxiv.org/abs/2503.05200)
Keywords: language model, gpt, llm, chat, retrieval-augmented generation, agent
Abstract: Despite the transformative impact of Large Language Models (LLMs) across critical domains such as healthcare, customer service, and business marketing, their integration into Open Radio Access Networks (O-RAN) remains limited. This gap is primarily due to the absence of domain-specific foundational models, with existing solutions often relying on general-purpose LLMs that fail to address the unique challenges and technical intricacies of O-RAN. To bridge this gap, we introduce ORANSight-2.0 (O-RAN Insights), a pioneering initiative aimed at developing specialized foundational LLMs tailored for O-RAN. Built on 18 LLMs spanning five open-source LLM frameworks, ORANSight-2.0 fine-tunes models ranging from 1 to 70B parameters, significantly reducing reliance on proprietary, closed-source models while enhancing performance for O-RAN. At the core of ORANSight-2.0 is RANSTRUCT, a novel Retrieval-Augmented Generation (RAG) based instruction-tuning framework that employs two LLM agents to create high-quality instruction-tuning datasets. The generated dataset is then used to fine-tune the 18 pre-trained open-source LLMs via QLoRA. To evaluate ORANSight-2.0, we introduce srsRANBench, a novel benchmark designed for code generation and codebase understanding in the context of srsRAN, a widely used 5G O-RAN stack. We also leverage ORANBench13K, an existing benchmark for assessing O-RAN-specific knowledge. Our comprehensive evaluations demonstrate that ORANSight-2.0 models outperform general-purpose and closed-source models, such as ChatGPT-4o and Gemini, by 5.421% on ORANBench and 18.465% on srsRANBench, achieving superior performance while maintaining lower computational and energy costs. We also experiment with RAG-augmented variants of ORANSight-2.0 LLMs and thoroughly evaluate their energy characteristics, demonstrating costs for training, standard inference, and RAG-augmented inference.
摘要：尽管大语言模型（LLM）在医疗保健，客户服务和业务营销等关键领域的变革性影响仍然有限。这一差距主要是由于缺乏特定领域的基础模型，现有的解决方案通常依赖于通用LLM，这些LLM无法解决O-Ran的独特挑战和技术复杂性。为了弥合这一差距，我们介绍了Oransight-2.0（O-Ran Insights），这是一项旨在开发专门针对O-Ran量身定制的专业基础LLM的开拓性计划。 Oransight-2.0微型型号构建了18个开源LLM框架，建于18个开源LLM框架，范围从1到70B参数，大大降低了对专有的，封闭式模型的依赖，同时增强了O-RAN的性能。 Oransight-2.0的核心是Rastruct，这是一种基于新颖的检索演示生成（RAG）指令调节框架，该框架使用两种LLM代理来创建高质量的指令调节数据集。然后，生成的数据集用于通过Qlora微调18个预训练的开源LLM。为了评估Oransight-2.0，我们介绍了Srsranbench，这是一种新颖的基准，旨在在SRSRAN的背景下为代码生成和代码库理解，这是一种广泛使用的5G O-Ran堆栈。我们还利用Oranbench13k（用于评估O-RAN特定知识的现有基准）。我们的全面评估表明，Oransight-2.0模型在Oranbench上的表现优于Chatgpt-4O和Gemini等封闭式模型，例如Chatgpt-4O和Gemini，在Srsranbench上的表现为5.421％，同时达到了卓越的性能，同时保持了较低的计算和能源成本。我们还尝试了Oransight-2.0 LLM的抹布式变体，并彻底评估其能量特征，证明了训练，标准推理和抹布的推理成本。

Title: Knowledge Updating? No More Model Editing! Just Selective Contextual Reasoning

Authors: Guoxiu He, Xin Song, Aixin Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05212
Pdf URL: https://arxiv.org/pdf/2503.05212
Copy Paste: [[2503.05212]] Knowledge Updating? No More Model Editing! Just Selective Contextual Reasoning(https://arxiv.org/abs/2503.05212)
Keywords: language model, llm
Abstract: As real-world knowledge evolves, the information embedded within large language models (LLMs) can become outdated, inadequate, or erroneous. Model editing has emerged as a prominent approach for updating LLMs' knowledge with minimal computational costs and parameter changes. This approach typically identifies and adjusts specific model parameters associated with newly acquired knowledge. However, existing methods often underestimate the adverse effects that parameter modifications can have on broadly distributed knowledge. More critically, post-edit LLMs frequently struggle with multi-hop reasoning and continuous knowledge updates. Although various studies have discussed these shortcomings, there is a lack of comprehensive evaluation. In this paper, we provide an evaluation of ten model editing methods along four dimensions: reliability, generalization, locality, and portability. Results confirm that all ten popular model editing methods show significant shortcomings across multiple dimensions, suggesting model editing is less promising. We then propose a straightforward method called Selective Contextual Reasoning (SCR), for knowledge updating. SCR does not modify model parameters but harnesses LLM's inherent contextual reasoning capabilities utilizing the updated knowledge pieces. Under SCR, an LLM first assesses whether an incoming query falls within the scope of an external knowledge base. If it does, the relevant external knowledge texts are contextualized to enhance reasoning; otherwise, the query is answered directly. We evaluate SCR against the ten model editing methods on two counterfactual datasets with three backbone LLMs. Empirical results confirm the effectiveness and efficiency of contextual reasoning for knowledge updating.
摘要：随着现实知识的发展，嵌入在大语言模型（LLM）中的信息可能会过时，不足或错误。模型编辑已成为一种突出的方法，用于以最小的计算成本和参数更改来更新LLMS知识。这种方法通常会识别并调整与新获得知识相关的特定模型参数。但是，现有方法通常低估了参数修改可能对广泛分布的知识产生的不利影响。更重要的是，后编辑LLMS经常在多跳的推理和持续的知识更新中挣扎。尽管各种研究都讨论了这些缺点，但缺乏全面的评估。在本文中，我们对沿四个维度的十种模型编辑方法进行评估：可靠性，概括，局部性和便携性。结果证实，所有十种流行的模型编辑方法都显示出在多个维度之间存在重大缺点，这表明模型编辑不太有希望。然后，我们提出了一种简单的方法，称为选择性上下文推理（SCR），以进行知识更新。 SCR不修改模型参数，而是利用LLM使用更新的知识作品的固有上下文推理功能。在SCR下，LLM首先评估传入查询是否属于外部知识库的范围。如果确实如此，相关的外部知识文本将被上下文化以增强推理；否则，查询将直接回答。我们根据两个具有三个骨干LLM的反事实数据集上的十种模型编辑方法评估SCR。经验结果证实了知识更新的上下文推理的有效性和效率。

Title: Personalized Text Generation with Contrastive Activation Steering

Authors: Jinghao Zhang, Yuting Liu, Wenjie Wang, Qiang Liu, Shu Wu, Liang Wang, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05213
Pdf URL: https://arxiv.org/pdf/2503.05213
Copy Paste: [[2503.05213]] Personalized Text Generation with Contrastive Activation Steering(https://arxiv.org/abs/2503.05213)
Keywords: llm, retrieval-augmented generation
Abstract: Personalized text generation aims to infer users' writing style preferences from their historical texts and generate outputs that faithfully reflect these stylistic characteristics. Existing solutions primarily adopt two paradigms: retrieval-augmented generation (RAG) and parameter-efficient fine-tuning (PEFT). While these approaches have advanced the field, they suffer from two critical limitations: (1) the entanglement of content semantics and stylistic patterns in historical texts impedes accurate modeling of user-specific writing preferences; and (2) scalability challenges arising from both RAG's inference latency by retrieval operations and PEFT's parameter storage requirements for per user model. To overcome these limitations, we propose StyleVector, a training-free framework that disentangles and represents personalized writing style as a vector in LLM's activation space, enabling style-steered generation during inference without requiring costly retrieval or parameter storage. Comprehensive experiments demonstrate that our framework achieves a significant 8% relative improvement in personalized generation while reducing storage requirements by 1700 times over PEFT method.
摘要：个性化的文本生成旨在从其历史文本中推断出用户的写作风格偏好，并产生忠实地反映这些风格特征的输出。现有的解决方案主要采用两个范式：检索增强生成（RAG）和参数有效的微调（PEFT）。尽管这些方法已推进了该领域，但它们遭受了两个关键局限性：（1）历史文本中内容语义和风格模式的纠缠阻碍了用户特定的写作偏好的准确建模；（2）通过检索操作和PEFT的参数存储要求每个用户模型引起的挑战。为了克服这些局限性，我们提出了stylevector，这是一个无训练的框架，它是LLM的激活空间中的矢量，将个性化的写作样式代表了个性化的写作样式，从而在推理过程中启用了样式稳定的生成，而无需昂贵的检索或参数存储。全面的实验表明，我们的框架可实现个性化生成的8％相对改善，而PEFT方法将存储要求降低了1700倍。

Title: MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio

Authors: Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, Mengyue Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05242
Pdf URL: https://arxiv.org/pdf/2503.05242
Copy Paste: [[2503.05242]] MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio(https://arxiv.org/abs/2503.05242)
Keywords: language model, llm, agent
Abstract: The rapid advancement of large language models (LLMs) and artificial intelligence-generated content (AIGC) has accelerated AI-native applications, such as AI-based storybooks that automate engaging story production for children. However, challenges remain in improving story attractiveness, enriching storytelling expressiveness, and developing open-source evaluation benchmarks and frameworks. Therefore, we propose and opensource MM-StoryAgent, which creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio. MM-StoryAgent designs a multi-agent framework that employs LLMs and diverse expert tools (generative models and APIs) across several modalities to produce expressive storytelling videos. The framework enhances story attractiveness through a multi-stage writing pipeline. In addition, it improves the immersive storytelling experience by integrating sound effects with visual, music and narrative assets. MM-StoryAgent offers a flexible, open-source platform for further development, where generative modules can be substituted. Both objective and subjective evaluation regarding textual story quality and alignment between modalities validate the effectiveness of our proposed MM-StoryAgent system. The demo and source code are available.
摘要：大型语言模型（LLM）和人工智能生成的内容（AIGC）的快速发展已经加速了AI-NENATIANG应用程序，例如基于AI的故事书，可以使儿童引人入胜。然而，在改善故事吸引力，丰富讲故事的表现力以及开发开源评估基准和框架方面仍然存在挑战。因此，我们提出和开放式MM模式，该元素层，它通过精致的绘图，符合角色的图像和多频道音频创建了沉浸式的叙述性视频故事书。 MM storyagent设计了一个多代理框架，该框架在几种模式中采用LLM和多样化的专家工具（生成模型和API）来制作富有表现力的讲故事视频。该框架通过多阶段的写作管道增强了故事的吸引力。此外，它通过将声音效果与视觉，音乐和叙事资产整合在一起来改善身临其境的讲故事经验。 MM-STORYAGENT提供了一个灵活的开源平台，用于进一步开发，可以在其中替换生成模块。关于文本故事质量和模式之间的客观评估和主观评估都证明了我们提出的MM层面系统的有效性。可用演示和源代码。

Title: ZOGRASCOPE: A New Benchmark for Property Graphs

Authors: Francesco Cazzaro, Justin Kleindienst, Sofia Marquez, Ariadna Quattoni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05268
Pdf URL: https://arxiv.org/pdf/2503.05268
Copy Paste: [[2503.05268]] ZOGRASCOPE: A New Benchmark for Property Graphs(https://arxiv.org/abs/2503.05268)
Keywords: llm, prompt
Abstract: Natural language interfaces to knowledge graphs have become increasingly important in recent years, enabling easy and efficient access to structured data. In particular property graphs have seen growing adoption. However, these kind of graphs remain relatively underrepresented in research, which has focused in large part on RDF-style graphs. As a matter of fact there is a lack of resources for evaluating systems on property graphs, with many existing datasets featuring relatively simple queries. To address this gap, we introduce ZOGRASCOPE, a benchmark designed specifically for the cypher query language. The benchmark includes a diverse set of manually annotated queries of varying complexity. We complement this paper with a set of experiments that test the performance of out-of-the-box LLMs of different sizes. Our experiments show that semantic parsing over graphs is still a challenging open problem that can not be solved by prompting LLMs alone.
摘要：近年来，与知识图的自然语言界面变得越来越重要，从而可以轻松有效地访问结构化数据。特别是在财产图中，采用的采用日益增加。但是，这类图在研究中的代表性相对不足，这很大程度上集中在RDF风格的图上。实际上，缺乏用于评估属性图上系统的资源，许多现有数据集都具有相对简单的查询。为了解决这一差距，我们介绍了Zograscope，这是一种专门为Cypher查询语言设计的基准测试。基准包括一组各种复杂性的手动注释的查询。我们通过一组测试不同尺寸的开箱即用LLM的性能的实验对本文进行补充。我们的实验表明，图形上的语义解析仍然是一个具有挑战性的开放问题，仅通过提示LLMS就无法解决。

Title: Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

Authors: Neemesh Yadav, Jiarui Liu, Francesco Ortu, Roya Ensafi, Zhijing Jin, Rada Mihalcea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05280
Pdf URL: https://arxiv.org/pdf/2503.05280
Copy Paste: [[2503.05280]] Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing(https://arxiv.org/abs/2503.05280)
Keywords: llm
Abstract: The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at this https URL
摘要：自然语言处理（NLP）方法将文本分类为多个类的能力促使它们在在线内容审核任务中的使用，例如仇恨言论和虚假新闻检测。但是，人们对这些方法的做出决定或为什么首先调节某些内容的理解有限。为了调查内容节制背后的隐藏机制，我们探讨了多个方向：1）培训分类器以反向工程师的内容调节决策； 2）通过分析沙普利价值观和LLM指导的解释来解释内容中等决策。我们的主要重点是使用Twitter流的先前存在的CORPORA，在跨国家做出的内容审核决定。我们的实验揭示了跨国和随着时间的审查帖子中有趣的模式。通过对LLM生成的三个LLM的解释的人体评估，我们评估了使用LLMS在内容节制中的有效性。最后，我们讨论了潜在的未来方向以及这项工作的局限性和道德考虑。我们的代码和数据可在此HTTPS URL上找到

Title: Similarity-Based Domain Adaptation with LLMs

Authors: Jie He, Wendi Zhou, Xiang Lorraine Li, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05281
Pdf URL: https://arxiv.org/pdf/2503.05281
Copy Paste: [[2503.05281]] Similarity-Based Domain Adaptation with LLMs(https://arxiv.org/abs/2503.05281)
Keywords: language model, llm
Abstract: Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data. Prior research has primarily focused on learning domain-invariant features across the source and target domains. However, these methods often require training a model using source domain data, which is time-consuming and can limit model usage for applications with different source data. This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation without the need of source model training, followed by a novel similarity-based knowledge distillation loss. Our extensive experiments on cross-domain text classification reveal that our framework achieves impressive performance, specifically, 2.44\% accuracy improvement when compared to the SOTA method.
摘要：无监督的域适应性利用了来自各种源域的大量标记数据，以推广到未标记的目标数据上。先前的研究主要集中在整个源和目标域的学习域不变特征上。但是，这些方法通常需要使用源域数据训练模型，该模型耗时，可以限制具有不同源数据的应用程序的模型使用情况。本文介绍了一个简单的框架，该框架利用大型语言模型（LLMS）的令人印象深刻的概括能力进行目标数据注释而无需源模型培训，然后是一种新颖的基于相似性的知识蒸馏损失。我们对跨域文本分类的广泛实验表明，与SOTA方法相比，我们的框架具有令人印象深刻的性能，特别是2.44 \％的精度提高。

Title: Coreference as an indicator of context scope in multimodal narrative

Authors: Nikolai Ilinykh, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05298
Pdf URL: https://arxiv.org/pdf/2503.05298
Copy Paste: [[2503.05298]] Coreference as an indicator of context scope in multimodal narrative(https://arxiv.org/abs/2503.05298)
Keywords: language model
Abstract: We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality.
摘要：我们证明，在视觉讲故事的任务中，大型多模式模型在核心表达式的分布中与人类有很大差异。我们介绍了许多指标，以量化人类和机器写入文本中核心模式的特征。人类以保持文本和图像之间保持一致性的方式分发核心表达式，以高度多样化的方式讲述了对不同实体的引用。尽管可以看到发电质量的改善，但机器仍无法跟踪混合参考。

Title: Uncertainty-Aware Decoding with Minimum Bayes Risk

Authors: Nico Daheim, Clara Meister, Thomas Möllenhoff, Iryna Gurevych
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05318
Pdf URL: https://arxiv.org/pdf/2503.05318
Copy Paste: [[2503.05318]] Uncertainty-Aware Decoding with Minimum Bayes Risk(https://arxiv.org/abs/2503.05318)
Keywords: language model
Abstract: Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. In short, we account for model uncertainty during decoding by incorporating a posterior over model parameters into MBR's computation of expected risk. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead. We benchmark different methods for learning posteriors and show that performance improves with prediction diversity. We release our code publicly.
摘要：尽管在大多数情况下它们的表现出色，但现代语言模型仍然偶尔会产生不良的输出，例如幻觉。尽管此类行为以前与不确定性有关，但显然缺乏在文本生成过程中积极考虑不确定性的方法。在这项工作中，我们展示了如何根据预期风险选择模型世代的最低贝叶斯风险（MBR）解码，可以将其推广到有原则的不确定性 - 意识到的解码方法中。简而言之，我们通过将模型参数纳入MBR的预期风险计算中来解释解码过程中的模型不确定性。我们表明，这种修改后的预期风险对于选择产出和决定何时弃用并可以提供改进而不会产生开销而有用。我们基于学习后期的不同方法，并表明绩效通过预测多样性提高。我们公开发布代码。

Title: Fine-Grained Evaluation for Implicit Discourse Relation Recognition

Authors: Xinyi Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05326
Pdf URL: https://arxiv.org/pdf/2503.05326
Copy Paste: [[2503.05326]] Fine-Grained Evaluation for Implicit Discourse Relation Recognition(https://arxiv.org/abs/2503.05326)
Keywords: language model
Abstract: Implicit discourse relation recognition is a challenging task in discourse analysis due to the absence of explicit discourse connectives between spans of text. Recent pre-trained language models have achieved great success on this task. However, there is no fine-grained analysis of the performance of these pre-trained language models for this task. Therefore, the difficulty and possible directions of this task is unclear. In this paper, we deeply analyze the model prediction, attempting to find out the difficulty for the pre-trained language models and the possible directions of this task. In addition to having an in-depth analysis for this task by using pre-trained language models, we semi-manually annotate data to add relatively high-quality data for the relations with few annotated examples in PDTB 3.0. The annotated data significantly help improve implicit discourse relation recognition for level-2 senses.
摘要：隐含的话语关系识别是在话语分析中的一项挑战性任务，因为文本跨度之间没有明确的话语连接。最近训练的语言模型在这项任务上取得了巨大成功。但是，对于此任务的这些预训练的语言模型的性能没有细粒度的分析。因此，此任务的困难和可能的方向尚不清楚。在本文中，我们深入分析了模型预测，试图找出预先训练的语言模型的困难以及该任务的可能方向。除了通过使用预训练的语言模型对该任务进行深入分析外，我们还会对数据进行半手册注释，以在PDTB 3.0中使用很少的带注释的示例添加相对较高的关系。带注释的数据大大有助于改善对2级感官的隐式话语关系识别。

Title: Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models

Authors: Anar Yeginbergen, Maite Oronoz, Rodrigo Agerri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05328
Pdf URL: https://arxiv.org/pdf/2503.05328
Copy Paste: [[2503.05328]] Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models(https://arxiv.org/abs/2503.05328)
Keywords: language model, llm
Abstract: This paper investigates the role of dynamic external knowledge integration in improving counter-argument generation using Large Language Models (LLMs). While LLMs have shown promise in argumentative tasks, their tendency to generate lengthy, potentially unfactual responses highlights the need for more controlled and evidence-based approaches. We introduce a new manually curated dataset of argument and counter-argument pairs specifically designed to balance argumentative complexity with evaluative feasibility. We also propose a new LLM-as-a-Judge evaluation methodology that shows a stronger correlation with human judgments compared to traditional reference-based metrics. Our experimental results demonstrate that integrating dynamic external knowledge from the web significantly improves the quality of generated counter-arguments, particularly in terms of relatedness, persuasiveness, and factuality. The findings suggest that combining LLMs with real-time external knowledge retrieval offers a promising direction for developing more effective and reliable counter-argumentation systems.
摘要：本文研究了动态外部知识整合在使用大语言模型（LLM）改善反题为生成中的作用。尽管LLM在有争议的任务中表现出了希望，但它们产生冗长的，潜在的不切实际反应的趋势突出了对更受控和基于证据的方法的需求。我们介绍了一个新的手动策划的参数数据集和专为平衡论证复杂性与评估可行性平衡的反题词对。我们还提出了一种新的LLM-AS-A-法官评估方法，该方法与传统的基于参考的指标相比，与人类判断的相关性更强。我们的实验结果表明，从网络中综合动态外部知识可显着提高产生的反论点的质量，尤其是在相关性，说服力和事实方面。研究结果表明，将LLM与实时外部知识检索相结合，为开发更有效和可靠的反辩论系统提供了有希望的方向。

Title: AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications

Authors: Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2503.05346
Pdf URL: https://arxiv.org/pdf/2503.05346
Copy Paste: [[2503.05346]] AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications(https://arxiv.org/abs/2503.05346)
Keywords: language model, llm
Abstract: The advent of Large Language Models (LLMs) has profoundly transformed our lives, revolutionizing interactions with AI and lowering the barrier to AI usage. While LLMs are primarily designed for natural language interaction, the extensive embedded knowledge empowers them to comprehend digital sensor data. This capability enables LLMs to engage with the physical world through IoT sensors and actuators, performing a myriad of AIoT tasks. Consequently, this evolution triggers a paradigm shift in conventional AIoT application development, democratizing its accessibility to all by facilitating the design and development of AIoT applications via natural language. However, some limitations need to be addressed to unlock the full potential of LLMs in AIoT application development. First, existing solutions often require transferring raw sensor data to LLM servers, which raises privacy concerns, incurs high query fees, and is limited by token size. Moreover, the reasoning processes of LLMs are opaque to users, making it difficult to verify the robustness and correctness of inference results. This paper introduces AutoIOT, an LLM-based automated program generator for AIoT applications. AutoIOT enables users to specify their requirements using natural language (input) and automatically synthesizes interpretable programs with documentation (output). AutoIOT automates the iterative optimization to enhance the quality of generated code with minimum user involvement. AutoIOT not only makes the execution of AIoT tasks more explainable but also mitigates privacy concerns and reduces token costs with local execution of synthesized programs. Extensive experiments and user studies demonstrate AutoIOT's remarkable capability in program synthesis for various AIoT tasks. The synthesized programs can match and even outperform some representative baselines.
摘要：大型语言模型（LLM）的出现深刻地改变了我们的生活，彻底改变了与AI的互动，并降低了AI使用的障碍。尽管LLM主要是为自然语言互动而设计的，但广泛的嵌入式知识使它们能够理解数字传感器数据。这种能力使LLM能够通过IoT传感器和执行器与物理世界互动，从而执行无数的AIOT任务。因此，这种演变触发了传统AIOT应用程序开发的范式转变，通过促进通过自然语言促进Aiot应用的设计和开发来使其对所有人的可访问性民主化。但是，需要解决一些局限性，以释放LLM在AIOT应用程序开发中的全部潜力。首先，现有的解决方案通常需要将原始传感器数据传输到LLM服务器，LLM服务器提出了隐私问题，会产生高查询费用，并且受令牌大小的限制。此外，LLM的推理过程对用户来说是不透明的，因此很难验证推理结果的鲁棒性和正确性。本文介绍了Autoiot，这是一种基于LLM的自动化程序生成器，用于AIOT应用程序。 Autoiot使用户能够使用自然语言（输入）指定其要求，并自动将可解释的程序与文档（输出）合成。 Autoiot自动化迭代优化，以最少的用户参与来增强生成的代码的质量。 Autoiot不仅可以使AIOT任务的执行更具解释性，而且还可以通过本地执行合成程序来减轻隐私问题并降低令牌成本。广泛的实验和用户研究表明，Autoiot在各种AIOT任务的程序合成方面具有出色的能力。合成的程序可以匹配甚至优于某些代表性基线。

Title: GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation

Authors: Zhenxuan Zhang, Kinhei Lee, Weihang Deng, Huichi Zhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Dominic C Marshall, Yingying Fang, Guang Yang
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2503.05347
Pdf URL: https://arxiv.org/pdf/2503.05347
Copy Paste: [[2503.05347]] GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation(https://arxiv.org/abs/2503.05347)
Keywords: language model, llm, agent
Abstract: Automatic medical report generation supports clinical diagnosis, reduces the workload of radiologists, and holds the promise of improving diagnosis consistency. However, existing evaluation metrics primarily assess the accuracy of key medical information coverage in generated reports compared to human-written reports, while overlooking crucial details such as the location and certainty of reported abnormalities. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs NER-F1 calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = 0.70 for Rexval dataset and Kendall coefficient = 0.54 for RadEvalX dataset). The anonymous project demo is available at: this https URL.
摘要：自动医疗报告的一代支持临床诊断，减少放射科医生的工作量，并有望提高诊断一致性。但是，现有的评估指标主要评估了与人撰写的报告相比，生成的报告中关键医疗信息覆盖率的准确性，同时忽略了关键的细节，例如所报告异常的位置和确定性。这些限制阻碍了对生成报告的可靠性的全面评估，并在选择临床使用时会带来风险。因此，我们在本文中提出了可解释的多代理分数（GEMA得分），该评分（GEMA得分）通过基于大型语言模型的大型多代理工作流进行客观定量和主观评估。我们的Gema得分解析结构化报告，并通过信息之间的交互式交流来评估疾病诊断，位置，严重性和不确定性，采用NER-F1计算。此外，基于LLM的评分代理评估了完整性，可读性和临床术语，同时提供了解释性反馈。广泛的实验验证了GEMA得分与公共数据集上的人类专家评估的最高相关性，这证明了其在临床评分方面的有效性（Rexval DataSet的Kendall系数= 0.70，Radevalx数据集的Kendall系数= 0.54）。匿名项目演示可用：此HTTPS URL。

Title: Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter

Authors: Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05362
Pdf URL: https://arxiv.org/pdf/2503.05362
Copy Paste: [[2503.05362]] Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter(https://arxiv.org/abs/2503.05362)
Keywords: language model, llm
Abstract: The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it rigidly trains models on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these limitations, we propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes strategy selection preferences at each dialogue turn. We first leverage Monte Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, enabling LLMs to generate more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, highlighting the efficacy of fine-grained, turn-level preference modeling in ESC.
摘要：现代社会中日益增长的情感压力增加了对情感支持对话的需求（ESC）。尽管大型语言模型（LLMS）对ESC表现出希望，但它们面临两个主要挑战：（1）策略选择准确性较低，以及（2）偏好偏见，将其适应性限制在用户的情感需求中。现有的监督微调（SFT）努力解决这些问题，因为它严格地训练了单个金标准响应的模型，而无需建模细微的战略权衡。为了克服这些局限性，我们提出了构成链链优化（CSO），这是一种新颖的方法，可在每个对话转弯时优化策略选择偏好。我们首先利用Monte Carlo Tree搜索来构建ESC-Pro，这是一个具有转向策略响应对的高质量偏好数据集。使用CSO进行ESC-PRO的培训可以提高策略的准确性和偏见缓解，从而使LLMS能够产生更多的善解人意和上下文适当的响应。在Llama-3.1-8b，Gemma-2-9b和Qwen2.5-7b上进行的实验表明，CSO的表现优于标准SFT，突出了ESC中细粒度，转交级偏好建模的功效。

Title: An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning

Authors: Navdeep Kaur, Lachlan McPheat, Alessandra Russo, Anthony G Cohn, Pranava Madhyastha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05439
Pdf URL: https://arxiv.org/pdf/2503.05439
Copy Paste: [[2503.05439]] An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning(https://arxiv.org/abs/2503.05439)
Keywords: language model, llm
Abstract: In this paper, we examine the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP) to enhance the performance of standard open-weight LLMs on complex multi-step reasoning tasks. Using the StepGame dataset, which requires spatial reasoning, we apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs. Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods, achieving substantial accuracy improvements across different levels of reasoning complexity. Additionally, the LLM-as-Judge metric enhances CLM's performance, especially in assessing structurally and logically correct ASP outputs. However, calibrating CLM with diverse calibration sets did not improve generalizability for tasks requiring much longer reasoning steps, indicating limitations in handling more complex tasks.
摘要：在本文中，我们研究了共形语言建模（CLM）以及答案集编程（ASP）的使用，以增强复杂多步推理任务上标准的开放式LLM的性能。使用需要空间推理的Stepgame数据集，我们将CLM应用于LLM生成ASP程序集，从而提供有关输出正确性的统计保证。实验结果表明，CLM显着胜过使用标准抽样方法的基线模型，从而在不同级别的推理复杂性上实现了实质性的准确性提高。此外，LLM-AS法官指标增强了CLM的性能，尤其是在结构和逻辑上正确的ASP输出时。但是，用不同的校准集对CLM进行校准并不能提高需要更长的推理步骤的任务的普遍性，这表明处理更复杂的任务时的限制。

Title: Statistical Guarantees of Correctness Coverage for Medical Multiple-Choice Question Answering

Authors: Yusong Ke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05505
Pdf URL: https://arxiv.org/pdf/2503.05505
Copy Paste: [[2503.05505]] Statistical Guarantees of Correctness Coverage for Medical Multiple-Choice Question Answering(https://arxiv.org/abs/2503.05505)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are increasingly deployed in real-world question-answering (QA) applications. However, LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal prediction (CP) is well-known to be model-agnostic and distribution-free, which creates statistically rigorous prediction sets in classification tasks. In this work, we for the first time adapt the CP framework to medical multiple-choice question-answering (MCQA) tasks, by correlating the nonconformity score with the frequency score of correct options grounded in self-consistency theory, assuming no access to internal model information. Considering that the adapted CP framework can only control the (mis)coverage rate, we employ a risk control framework, which can manage task-specific metrics by devising a monotonically decreasing loss function. We evaluate our framework on 3 popular medical MCQA datasets utilizing 4 ``off-the-shelf'' LLMs. Empirical results demonstrate that we achieve user-specified average (or marginal) error rates on the test set. Furthermore, we observe that the average prediction set size (APSS) on the test set decreases as the risk level increases, which concludes a promising evaluation metric for the uncertainty of LLMs.
摘要：大型语言模型（LLMS）越来越多地部署在现实世界中的问题（QA）应用程序中。但是，LLM已被证明会产生幻觉和非事实信息，从而破坏了他们在高风险医疗任务中的可信赖性。共形预测（CP）是众所周知的，是模型的敏锐性和无分布的，它在分类任务中创建了统计上严格的预测集。在这项工作中，我们首次将CP框架适应医疗多项选择问答（MCQA）任务，通过将非符号分数与基于自称理论的正确选项的频率分数相关联，假设无法访问内部模型信息。考虑到适应的CP框架只能控制（MIS）覆盖率，我们采用了风险控制框架，该框架可以通过设计单调降低损失函数来管理特定于任务的指标。我们在3个受欢迎的医疗MCQA数据集上评估了我们的框架，该数据集利用4`````''''llms''。经验结果表明，我们在测试集上实现了用户指定的平均值（或边际）错误率。此外，我们观察到，随着风险水平的增加，测试集上的平均预测设置大小（APS）减小，这是LLMS不确定性的有希望的评估度量。

Title: Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data

Authors: Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05587
Pdf URL: https://arxiv.org/pdf/2503.05587
Copy Paste: [[2503.05587]] Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data(https://arxiv.org/abs/2503.05587)
Keywords: language model, llm
Abstract: Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks spurious features (a.k.a. implicit noise). While previous works have explored spurious features in LLMs, they are limited to specific features (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we statistically confirm the presence of spurious features in the RAG paradigm, a robustness problem caused by the sensitivity of LLMs to semantic-agnostic features. Moreover, we provide a comprehensive taxonomy of spurious features and empirically quantify their impact through controlled experiments. Further analysis reveals that not all spurious features are harmful and they can even be beneficial sometimes. Extensive evaluation results across multiple LLMs suggest that spurious features are a widespread and challenging problem in the field of RAG. The code and dataset will be released to facilitate future research. We release all codes and data at: $\\\href{this https URL}{this https URL}$.
摘要：鲁棒性已成为现实世界应用中抹布系统部署的关键属性。现有的研究重点是对明确噪声的鲁棒性（例如，文档语义），但忽略了虚假特征（又称隐式噪声）。尽管以前的作品探索了LLM中的虚假功能，但它们仅限于特定功能（例如格式）和狭窄方案（例如ICL）。在这项工作中，我们从统计学上证实了RAG范式中存在虚假特征，这是由LLMS对语义敏捷特征的敏感性引起的鲁棒性问题。此外，我们提供了一个全面的伪造特征分类法，并通过受控实验从经验上量化其影响。进一步的分析表明，并非所有伪造特征都是有害的，有时甚至可能是有益的。多个LLM的广泛评估结果表明，伪造的特征在RAG领域是一个普遍且具有挑战性的问题。该代码和数据集将发布以促进未来的研究。我们在以下位置发布所有代码和数据：$ \\\ HREF {此https url} {此https url} $。

Title: AceWGS: An LLM-Aided Framework to Accelerate Catalyst Design for Water-Gas Shift Reactions

Authors: Joyjit Chattoraj, Brahim Hamadicharef, Teo Shi Chang, Yingzhi Zeng, Chee Kok Poh, Luwei Chen, Teck Leong Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05607
Pdf URL: https://arxiv.org/pdf/2503.05607
Copy Paste: [[2503.05607]] AceWGS: An LLM-Aided Framework to Accelerate Catalyst Design for Water-Gas Shift Reactions(https://arxiv.org/abs/2503.05607)
Keywords: language model, llm
Abstract: While the Water-Gas Shift (WGS) reaction plays a crucial role in hydrogen production for fuel cells, finding suitable catalysts to achieve high yields for low-temperature WGS reactions remains a persistent challenge. Artificial Intelligence (AI) has shown promise in accelerating catalyst design by exploring vast candidate spaces, however, two key gaps limit its effectiveness. First, AI models primarily train on numerical data, which fail to capture essential text-based information, such as catalyst synthesis methods. Second, the cross-disciplinary nature of catalyst design requires seamless collaboration between AI, theory, experiments, and numerical simulations, often leading to communication barriers. To address these gaps, we present AceWGS, a Large Language Models (LLMs)-aided framework to streamline WGS catalyst design. AceWGS interacts with researchers through natural language, answering queries based on four features: (i) answering general queries, (ii) extracting information about the database comprising WGS-related journal articles, (iii) comprehending the context described in these articles, and (iv) identifying catalyst candidates using our proposed AI inverse model. We presented a practical case study demonstrating how AceWGS can accelerate the catalyst design process. AceWGS, built with open-source tools, offers an adjustable framework that researchers can readily adapt for a range of AI-accelerated catalyst design applications, supporting seamless integration across cross-disciplinary studies.
摘要：虽然水天然气转移（WGS）反应在燃料电池的氢产生中起着至关重要的作用，但找到合适的催化剂以实现低温WGS反应的高产量仍然是一项持续的挑战。人工智能（AI）通过探索庞大的候选空间来表现出在加速催化剂设计方面的希望，但是，两个关键差距限制了其有效性。首先，AI模型主要训练数值数据，这些数据无法捕获基于文本的基本信息，例如催化剂合成方法。其次，催化剂设计的跨学科性质需要AI，理论，实验和数值模拟之间的无缝协作，通常会导致通信障碍。为了解决这些差距，我们提出了ACEWGS，即大型语言模型（LLMS）辅助框架，以简化WGS Catalyst设计。 ACEWG通过自然语言与研究人员互动，根据四个特征回答查询：（i）回答一般查询，（ii）提取有关包含WGS与WGS相关期刊文章的数据库的信息，（iii）理解这些文章中所述的上下文，以及（iv）使用我们的拟议AI AI AI INIVERESEREVERSE AREVERESE AREVERESE AREVERESE AI识别催化剂候选者。我们提出了一项实际的案例研究，展示了ACEWG如何加速催化剂设计过程。使用开源工具构建的ACEWG提供了一个可调节的框架，研究人员可以轻松适应一系列AI-Accelerated Catalyst设计应用程序，从而支持跨学科研究的无缝集成。

Title: Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings

Authors: Xuanqing Liu, Luyang Kong, Wei Niu, Afshin Khashei, Belinda Zeng, Steve Johnson, Jon Jay, Davor Golac, Matt Pope
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05620
Pdf URL: https://arxiv.org/pdf/2503.05620
Copy Paste: [[2503.05620]] Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings(https://arxiv.org/abs/2503.05620)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. However, analyzing live dialogues in real-time necessitates low-latency processing systems, making it impractical to deploy models with billions of parameters due to latency constraints. As a result, practitioners often prefer smaller models with millions of parameters, trained on high-quality, human-annotated datasets. Yet, curating such datasets is both time-consuming and costly. Consequently, there is a growing need to combine the scalability of LLM-generated labels with the precision of human annotations, enabling fine-tuned smaller models to achieve both higher speed and accuracy comparable to larger models. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more. To mitigate the impact of labeling errors from LLMs -- the primary source of inaccuracies in student models -- we propose a noise-reduced preference learning loss. Experimental results demonstrate that our method significantly improves accuracy across utterance-level dialogue tasks, including sentiment detection (over $2\%$), dialogue act classification (over $1.5\%$), etc.
摘要：大型语言模型（LLMS）在处理复杂的对话任务时表现出了显着的功能，而无需特定于用例的微调。但是，在实时分析实时对话需要低延迟处理系统，因此由于延迟约束而具有数十亿个参数的模型是不切实际的。结果，从业人员通常更喜欢具有数百万参数的较小型号，该模型接受了高质量的人类注销数据集培训。但是，策划此类数据集既耗时又昂贵。因此，越来越需要将LLM生成的标签的可扩展性与人类注释的精确度相结合，从而实现了微调的较小模型，以达到与较大模型相当的更高速度和准确性。在本文中，我们引入了一个简单而有效的框架来应对这一挑战。我们的方法是专门针对的，用于每项量表分类问题，其中包括意图检测，对话状态跟踪等任务。为了减轻LLM的标记错误的影响（学生模型中不准确的主要来源），我们提出了减少降噪的偏好学习损失。实验结果表明，我们的方法显着提高了跨话语级对话任务的准确性，包括情感检测（超过$ 2 \％$），对话ACT分类（超过$ 1.5 \％$ $），等等。

Title: Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05641
Pdf URL: https://arxiv.org/pdf/2503.05641
Copy Paste: [[2503.05641]] Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning(https://arxiv.org/abs/2503.05641)
Keywords: gpt, llm, agent
Abstract: Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch inference strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that Symbolic-MoE outperforms strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute average improvement of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.
摘要：结合现有的预培训的专家LLMS是一条有前途的途径，可缩减大规模和多样化的任务。但是，在任务级别的选择专家通常太粗糙了，因为异质任务可能需要每个实例的不同专业知识。为了启用预先训练的LLM专家的自适应实例级别的混合，我们提出了符号-MOE，这是一种符号，基于文本和无梯度的混合物框架。 Symbolic-MoE通过强调技能，例如，在生物医学推理中的数学或分子生物学方面的代数来采用精细的选择方法。我们提出了一种基于技能的招聘策略，该策略会根据其优势动态选择最相关的专家LLM来用于各种推理任务。然后，每个选定的专家都会产生自己的推理，从而导致K专家的K输出，然后根据其基于集成多样的推理输出的能力而选择的聚合器将其合成为最终的高质量响应。我们表明，符号-MOE的实例级专家选择可以通过很大的利润提高性能，但是（天真地实现）可以引入高度计算间接费用，因为需要恒定的模型加载和卸载。为了解决这个问题，我们实施了一个批次推理策略，该策略将根据分配的专家进行分组，仅加载每个模型一次。这使我们能够在1 GPU上集成16个专家模型，其时间成本与使用4 GPU相比的时间成本可比或更好。通过对各种基准测试（MMLU-PRO，GPQA，AIME和MEDMCQA）的广泛评估，我们证明，符号-MOE的表现优于GPT4O-MINI（例如GPT4O-MINI），以及多方面的方法，绝对平均值高于最佳多代理基线，而多8.15％的方法提高了。此外，Symbolic-MoE消除了对昂贵的多轮讨论的必要性，以较少的计算来表现优于讨论基线。

Title: Understanding the Limits of Lifelong Knowledge Editing in LLMs

Authors: Lukas Thede, Karsten Roth, Matthias Bethge, Zeynep Akata, Tom Hartvigsen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05683
Pdf URL: https://arxiv.org/pdf/2503.05683
Copy Paste: [[2503.05683]] Understanding the Limits of Lifelong Knowledge Editing in LLMs(https://arxiv.org/abs/2503.05683)
Keywords: language model, llm
Abstract: Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce WikiBigEdit; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use WikiBigEdit to study existing knowledge editing techniques' ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.
摘要：保持大型语言模型实际上是最新的，对于部署至关重要，但是昂贵的再培训仍然是一个挑战。知识编辑提供了一种有希望的替代方案，但是方法仅在小规模或合成编辑基准上进行测试。在这项工作中，我们旨在将研究介绍为终身知识编辑，以实际相关规模为现实世界的编辑。我们首先介绍Wikibigedit；现实世界中的Wikidata编辑的大规模基准，旨在自动延长终身性以进行未来的基准测试。首先，它包括超过500k的问题解答，用于知识编辑以及全面的评估管道。最后，我们使用WikiBigedIt来研究现有的知识编辑技术的能力，可以纳入大量现实世界事实，并将其对比，以对比其能力，以使其能够进行通用的修改技术，例如检索增强和持续的填充，以获得当前终身知识编辑的实际范围。