2025-04-18

Title: Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability

Authors: Devansh Singh, Sundaraparipurnan Narayanan
Subjects: cs.CL, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2504.12308
Pdf URL: https://arxiv.org/pdf/2504.12308
Copy Paste: [[2504.12308]] Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability(https://arxiv.org/abs/2504.12308)
Keywords: language model
Abstract: Privacy Masking is a critical concept under data privacy involving anonymization and de-anonymization of personally identifiable information (PII). Privacy masking techniques rely on Named Entity Recognition (NER) approaches under NLP support in identifying and classifying named entities in each text. NER approaches, however, have several limitations including (a) content sensitivity including ambiguous, polysemic, context dependent or domain specific content, (b) phrasing variabilities including nicknames and alias, informal expressions, alternative representations, emerging expressions, evolving naming conventions and (c) formats or syntax variations, typos, misspellings. However, there are a couple of PII datasets that have been widely used by researchers and the open-source community to train models on PII detection or masking. These datasets have been used to train models including Piiranha and Starpii, which have been downloaded over 300k and 580k times on HuggingFace. We examine the quality of the PII masking by these models given the limitations of the datasets and of the NER approaches. We curate a dataset of 17K unique, semi-synthetic sentences containing 16 types of PII by compiling information from across multiple jurisdictions including India, U.K and U.S. We generate sentences (using language models) containing these PII at five different NER detection feature dimensions - (1) Basic Entity Recognition, (2) Contextual Entity Disambiguation, (3) NER in Noisy & Real-World Data, (4) Evolving & Novel Entities Detection and (5) Cross-Lingual or multi-lingual NER) and 1 in adversarial context. We present the results and exhibit the privacy exposure caused by such model use (considering the extent of lifetime downloads of these models). We conclude by highlighting the gaps in measuring performance of the models and the need for contextual disclosure in model cards for such models.
摘要：隐私掩蔽是数据隐私下的一个关键概念，涉及个人身份信息（PII）的匿名化和匿名化。隐私掩蔽技术依赖于NLP支持下指定的实体识别（NER）方法，以识别和分类每个文本中的命名实体。 NER approaches, however, have several limitations including (a) content sensitivity including ambiguous, polysemic, context dependent or domain specific content, (b) phrasing variabilities including nicknames and alias, informal expressions, alternative representations, emerging expressions, evolving naming conventions and (c) formats or syntax variations, typos, misspellings.但是，研究人员和开源社区广泛使用了一些PII数据集来培训有关PII检测或掩盖的模型。这些数据集已用于培训包括Piiranha和Starpii在内的模型，这些模型已在Huggingface上下载了300K和580K次。考虑到数据集的局限性和NER方法的局限性，我们通过这些模型检查了PII掩蔽的质量。我们通过编译来自印度，英国和美国在内的多个管辖区的信息来策划一个17K独特的半合成句子，其中包含16种PII，我们在五个不同的ner检测功能dimens-（1）基本实体识别，（2）notabe notity distity notib and（2）NONABS（2）NON-NONABS（3）中，（3）NON-NON-（3），（3）不断发展的新实体检测以及（5）跨语性或多语言NER）和1在对抗环境中。我们介绍结果并展示由这种模型使用引起的隐私暴露（考虑到这些模型的终生下载程度）。最后，我们通过强调测量模型性能的差距以及在模型卡中为此类模型中的上下文披露的需求。

Title: Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Authors: Enming Zhang, Liwen Cao, Yanru Wu, Zijie Zhao, Guan Wang, Yang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12311
Pdf URL: https://arxiv.org/pdf/2504.12311
Copy Paste: [[2504.12311]] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer(https://arxiv.org/abs/2504.12311)
Keywords: prompt
Abstract: Prompt tuning has emerged as a lightweight adaptation strategy for adapting foundation models to downstream tasks, particularly in resource-constrained systems. As pre-trained prompts have become valuable intellectual assets, combining multiple source prompts offers a promising approach to enhance generalization to new tasks by leveraging complementary knowledge from diverse sources. However, naive aggregation of these prompts often leads to representation collapse due to mutual interference, undermining their collective potential. To address these challenges, we propose HGPrompt, an adaptive framework for multi-source prompt transfer that learns optimal ensemble weights by jointly optimizing dual objectives: transferability and stability. Specifically, we first introduce an information-theoretic metric to evaluate the transferability of prompt-induced features on the target task, capturing the intrinsic alignment between the feature representations. Additionally, we propose a novel Gradient Alignment Regularization to mitigate gradient conflicts among prompts, enabling stable and coherent knowledge transfer from multiple sources while suppressing interference. Extensive experiments on the large-scale VTAB benchmark demonstrate that HGPrompt achieves state-of-the-art performance, validating its effectiveness in multi-source prompt transfer.
摘要：迅速调整已成为一种轻巧的适应策略，用于将基础模型调整到下游任务，尤其是在资源受限系统中。随着预培训的提示已成为有价值的智力资产，将多个来源提示结合起来，通过利用来自不同来源的补充知识来增强对新任务的概括提供了有希望的方法。但是，这些提示的天真聚集通常会导致由于相互干扰而导致的表示，从而破坏了它们的集体潜力。为了应对这些挑战，我们提出了HGPROMPT，这是一个多源提示转移的自适应框架，通过共同优化双重目标来学习最佳的合奏权重：可传递性和稳定性。具体而言，我们首先引入信息理论指标，以评估目标任务上及时诱导的特征的可传递性，从而捕获特征表示之间的内在对齐。此外，我们提出了一种新颖的梯度对齐正规化，以减轻提示之间的梯度冲突，从而从多个来源中稳定而连贯的知识转移，同时抑制干扰。大规模VTAB基准测试的广泛实验表明，HGPrompt可以实现最先进的性能，从而在多源及时转移中验证了其有效性。

Title: Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Authors: Zihao Xu, Junchen Ding, Yiling Lou, Kun Zhang, Dong Gong, Yuekang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12312
Pdf URL: https://arxiv.org/pdf/2504.12312
Copy Paste: [[2504.12312]] Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles(https://arxiv.org/abs/2504.12312)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.
摘要：大型语言模型（LLM）在语言理解和推理方面取得了重大进展。因此，评估和分析其逻辑推理能力已经变得至关重要。但是，现有的数据集和基准通常仅限于过于简单，不自然或上下文约束的示例。为了响应不断增长的需求，我们引入了Smartypat Bench，这是一个具有挑战性的，自然表达的，系统地标记的基准，这些基准是从现实世界中的高质量Reddit帖子中得出的，其中包含微妙的逻辑谬误。与现有的数据集和基准不同，它提供了更详细的逻辑谬论注释，并具有更多样化的数据。为了进一步扩大研究并解决手动数据收集和标签的局限性（例如谬误型失衡和劳动密集型注释），我们介绍了由基于逻辑编程的Oracles提供支持的自动化框架Smart-ypat。 SmartYpat利用Prolog规则系统地生成逻辑上谬误的陈述，然后将其改进为LLMS流利的自然语言句子，以确保精确的谬误表示。广泛的评估表明，SmartyPat产生的谬论与人类生成的内容相当，并且质量可比，并明显胜过基线方法。最后，实验揭示了对LLM功能的细微洞察力，强调，尽管过多的推理步骤阻碍了谬误检测准确性，但结构化推理增强了谬误分类的性能。

Title: Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models

Authors: Xiaoyan Zhao, Yang Deng, Wenjie Wang, Hongzhan lin, Hong Cheng, Rui Zhang, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12313
Pdf URL: https://arxiv.org/pdf/2504.12313
Copy Paste: [[2504.12313]] Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models(https://arxiv.org/abs/2504.12313)
Keywords: language model, llm, prompt, agent
Abstract: Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.
摘要：会话推荐系统（CRSS）让用户参与多转交互，以提供个性化的建议。大型语言模型（LLM）的出现通过实现更自然和动态的用户交互进一步增强了这些系统。但是，在理解人格特征如何塑造对话推荐成果方面的主要挑战仍然存在。心理证据突出了人格特征对用户互动行为的影响。为了解决这个问题，我们为CRS（PERCR）引入了基于LLM的个性感知用户模拟。用户代理诱导可自定义的人格特征和偏好，而系统代理具有模拟CRS中现实互动的说服力。我们纳入了多种评估，以确保从用户和系统角度进行鲁棒性并进行广泛的分析。实验结果表明，最先进的LLM可以有效地产生与指定人格特征一致的多样化用户响应，从而促使CRS动态调整其建议策略。我们的实验分析提供了对人格特征对会话推荐系统结果的影响的经验见解。

Title: How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

Authors: Hao Li, Liuzhenghao Lv, He Cao, Zijing Liu, Zhiyuan Yan, Yu Wang, Yonghong Tian, Yu Li, Li Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12314
Pdf URL: https://arxiv.org/pdf/2504.12314
Copy Paste: [[2504.12314]] How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension(https://arxiv.org/abs/2504.12314)
Keywords: language model, llm, hallucination
Abstract: Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.
摘要：大型语言模型越来越多地用于科学领域，尤其是用于分子理解和分析。但是，现有模型受幻觉问题的影响，从而导致药物设计和利用率错误。在本文中，我们首先分析了LLMS中的幻觉来源，以完成分子理解任务，特别是在PubChem数据集中观察到的知识快捷方式。为了通过计算效率评估分子理解任务中的幻觉，我们介绍了\ textbf {mol-hallu}，这是一种新型的自由形式评估度量，该度量量化了基于生成的文本和实际分子特性之间的科学意义关系来量化幻觉程度。利用摩尔 - 哈卢度量，我们在执行分子理解任务的各种LLMS中重新评估和分析幻觉的程度。此外，提出了幻觉减少后处理阶段（HRPP）来减轻分子幻觉，实验显示了HRPP对仅解码器的和coder-decoder-decoder分子LLM的有效性。我们的发现提供了减轻幻觉和提高科学应用中LLM的可靠性的关键见解。

Title: Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

Authors: Xingguang Ji, Jiakang Wang, Hongzhi Zhang, Jingyuan Zhang, Haonan Zhou, Chenxi Sun, Yahui Liu, Qi Wang, Fuzheng Zhang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.12315
Pdf URL: https://arxiv.org/pdf/2504.12315
Copy Paste: [[2504.12315]] Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models(https://arxiv.org/abs/2504.12315)
Keywords: language model, llm, chat
Abstract: With the development of Multimodal Large Language Models (MLLMs), numerous outstanding accomplishments have emerged within the open-source community. Due to the complexity of creating and training multimodal data pairs, it is still a computational and time-consuming process to build powerful MLLMs. In this work, we introduce Capybara-OMNI, an MLLM that trains in a lightweight and efficient manner and supports understanding text, image, video, and audio modalities. We present in detail the framework design, the data construction, and the training recipe, to develop an MLLM step-by-step to obtain competitive performance. We also provide exclusive benchmarks utilized in our experiments to show how to properly verify understanding capabilities across different modalities. Results show that by following our guidance, we can efficiently build an MLLM that achieves competitive performance among models of the same scale on various multimodal benchmarks. Additionally, to enhance the multimodal instruction following and conversational capabilities of the model, we further discuss how to train the chat version upon an MLLM understanding model, which is more in line with user habits for tasks like real-time interaction with humans. We publicly disclose the Capybara-OMNI model, along with its chat-based version. The disclosure includes both the model weights, a portion of the training data, and the inference codes, which are made available on GitHub.
摘要：随着多模式大语言模型（MLLM）的发展，开源社区中出现了许多出色的成就。由于创建和训练多模式数据对的复杂性，它仍然是一个计算和耗时的过程，可以构建强大的MLLM。在这项工作中，我们介绍了以轻巧有效的方式训练的MLLM Capybara-Omni，并支持理解文本，图像，视频和音频方式。我们详细介绍了框架设计，数据构建和培训配方，以逐步开发MLLM以获得竞争性能。我们还提供了实验中使用的独家基准，以展示如何正确验证不同方式跨不同方式的理解能力。结果表明，通过遵循我们的指导，我们可以有效地建立一个MLLM，该MLLM在各种多模式基准的相同模型之间达到竞争性能。此外，为了增强模型的多模式说明和对话能力，我们进一步讨论了如何在MLLM理解模型上训练聊天版本，这更符合与人类与人类实时互动之类的任务的用户习惯。我们公开披露了Capybara-Omni模型以及基于聊天的版本。披露既包括模型权重，一部分培训数据和推理代码，这些代码可在GitHub上提供。

Title: Data Metabolism: An Efficient Data Design Schema For Vision Language Model

Authors: Jingyuan Zhang, Hongzhi Zhang, Zhou Haonan, Chenxi Sun, Xingguang ji, Jiakang Wang, Fanheng Kong, Yahui Liu, Qi Wang, Fuzheng Zhang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.12316
Pdf URL: https://arxiv.org/pdf/2504.12316
Copy Paste: [[2504.12316]] Data Metabolism: An Efficient Data Design Schema For Vision Language Model(https://arxiv.org/abs/2504.12316)
Keywords: language model
Abstract: Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.
摘要：数据策划在训练强大的视觉语言模型（VLM）中起着至关重要的作用。在这项工作中，我们介绍了数据代谢的概念，并提出了以数据为中心的框架，以在整个开发生命周期内构建VLM。从标准模型体系结构开始，我们讨论并提供有关两个关键开发步骤的见解：数据策展和迭代，形成一个闭环系统，该系统不断改善模型性能。我们展示了一本有关如何处理现有大规模数据集并构建特定用户数据飞轮的详细代码手册。作为演示，我们发布了一个名为Capybara-VL的VLM，该VLM在典型的多模式任务（例如，视觉问题回答，科学推理和文本丰富的任务）中表现出色。尽管辣椒粉的尺寸相对紧凑，但它超过了几个尺寸大10倍的开源型号。此外，它取得了与几种领先的专有模型的结果相提并论的结果，这表明了其非凡的竞争力。这些结果突出了我们以数据为中心的框架的力量以及训练较小，更有效的VLM的潜力。

Title: ChatGPT as Linguistic Equalizer? Quantifying LLM-Driven Lexical Shifts in Academic Writing

Authors: Dingkang Lin, Naixuan Zhao, Dan Tian, Jiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12317
Pdf URL: https://arxiv.org/pdf/2504.12317
Copy Paste: [[2504.12317]] ChatGPT as Linguistic Equalizer? Quantifying LLM-Driven Lexical Shifts in Academic Writing(https://arxiv.org/abs/2504.12317)
Keywords: gpt, llm, chat
Abstract: The advent of ChatGPT has profoundly reshaped scientific research practices, particularly in academic writing, where non-native English-speakers (NNES) historically face linguistic barriers. This study investigates whether ChatGPT mitigates these barriers and fosters equity by analyzing lexical complexity shifts across 2.8 million articles from OpenAlex (2020-2024). Using the Measure of Textual Lexical Diversity (MTLD) to quantify vocabulary sophistication and a difference-in-differences (DID) design to identify causal effects, we demonstrate that ChatGPT significantly enhances lexical complexity in NNES-authored abstracts, even after controlling for article-level controls, authorship patterns, and venue norms. Notably, the impact is most pronounced in preprint papers, technology- and biology-related fields and lower-tier journals. These findings provide causal evidence that ChatGPT reduces linguistic disparities and promotes equity in global academia.
摘要：Chatgpt的出现深刻地重塑了科学研究实践，尤其是在学术写作中，在历史上，非母语英语讲者（NNE）面临语言障碍。这项研究调查了Chatgpt是否通过分析Openalex（2020-2024）的280万篇文章中的词汇复杂性转移来减轻这些障碍和促进权益。使用文本词汇多样性（MTLD）的度量来量化词汇复杂性和差异差异（DID）设计以识别因果效应，我们证明，即使在控制文章级别的控件，授权模式，和地目的规范之后，ChatGPT也会显着增强NNES撰写的摘要中的词汇复杂性。值得注意的是，在预印本论文，技术和生物学相关的领域以及低层期刊中，影响最为明显。这些发现提供了因果证据，可以减少语言差异并促进全球学术界的公平性。

Title: Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability

Authors: Jennifer Haase, Paul H. P. Hanel, Sebastian Pokutta
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.12320
Pdf URL: https://arxiv.org/pdf/2504.12320
Copy Paste: [[2504.12320]] Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability(https://arxiv.org/abs/2504.12320)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs -- including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek -- across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.
摘要：在2023年初广泛采用Chatgpt之后，许多研究报告说，大型语言模型（LLMS）可以匹配甚至超过人类在创意任务中的表现。但是，尚不清楚LLM是否随着时间的流逝而变得更具创造力，以及其创造性产出的一致性。在这项研究中，我们评估了14个广泛使用的LLM，包括GPT-4，Claude，Llama，Grok，Mistral和Deepseek，跨越了两个经过验证的创造力评估：Divergent Cossigition Task（DAT）和替代性用途任务（AUT）。与期望相反，我们没有发现在过去的18-24个月中创造性表现提高的证据，而GPT-4的表现要比以前的研究差。对于更广泛使用的AUT，所有模型平均表现要比普通人表现出色，而GPT-4O和O3-MINI表现最好。但是，只有0.28％的LLM生成的反应达到了人类创造力基准的前10％。除了模型间差异之外，我们还记录了大量模型内变异性：相同的LLM在相同的提示下，可以产生从低于平均值到原始的输出。这种可变性对创造力研究和实际应用都具有重要意义。忽略这种可变性的风险误解了LLM的创造潜力，即膨胀或低估了其能力。提示的选择对LLMS的影响有所不同。我们的发现强调了对更细微的评估框架的必要性，并强调了在创意环境中使用生成AI（Genai）工具时模型选择，及时设计和重复评估的重要性。

Title: AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks

Authors: Charlotte Siska, Anush Sankaran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12321
Pdf URL: https://arxiv.org/pdf/2504.12321
Copy Paste: [[2504.12321]] AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks(https://arxiv.org/abs/2504.12321)
Keywords: language model, gpt, llm, prompt, agent
Abstract: In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed-box approaches. In this research, we propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM-based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding-based classifiers and GPT-4 zero-shot this http URL further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed-loop LLM-based multi-agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.
摘要：在过去的几年中，语言模型（LMS）在几个领域显示出par-human的功能。尽管它们的实用应用并超过了用户消费，但当恶意投入利用LM的弱点时，它们仍遭受越狱的影响，从而导致其偏离其预期行为。当前的防御策略要么将输入提示归类为对抗性，要么防止LMS产生有害输出。但是，解释越狱的恶意性质背后的原因是一项挑战，这导致了各种各样的封闭式方法。在这项研究中，我们提出并证明了小语言模型（SLM）的系统推测可以用来表征对抗性提示，从而提供一种新颖，可解释且更便宜的防御方法，称为注意力防御。我们的研究表明，注意机制是理解和解释LMS对恶意输入的反应的组成部分，而恶意输入未在文本嵌入的语义含义中捕获。拟议的注意防御对现有的越狱基准数据集进行了评估。消融研究表明，与基于文本的基于嵌入的分类器相比，基于SLM的注意力防御具有等效或更好的越狱检测性能，而GPT-4零射击此HTTP URL进一步验证了拟议方法的功效，我们生成了使用基于封闭的基于封闭的基于封闭的基于封闭的基于封闭的基于封闭式数据集的新型越狱数据集的数据集。我们证明，拟议的注意防御方法在这款新颖的越狱数据集中表现出色，而现有方法的性能受到了影响。此外，出于实际目的，注意力防御是理想的解决方案，因为它具有小LM的计算要求，但具有LLM检测器的性能。

Title: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis

Authors: Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Conghui He, Lijun Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12322
Pdf URL: https://arxiv.org/pdf/2504.12322
Copy Paste: [[2504.12322]] A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis(https://arxiv.org/abs/2504.12322)
Keywords: language model, llm, agent
Abstract: While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at this https URL.
摘要：尽管数据综合和蒸馏是增强小语言模型的有希望的策略，但当前的方法在很大程度上依赖大型语言模型（LLMS），这些模型（LLMS）遭受了高度计算成本，环境效率低下以及从整体体系结构继承的潜在偏见。相比之下，较小的LLM更容易访问和可持续，但是它们的个人能力通常在产生高质量，多样化和可靠的数据方面缺乏。受其协作人类过程的启发（例如，同行评审），我们提出了一个涉及多个小型LLMS，即GRA，该框架汇总了小型LLM的专业角色，以迭代精致和质量控制通常由单个大型LLM实现。在此协作框架中，多个小型LLM假设审查器，审稿人和裁决者模拟了同行评审启发的数据合成管道。发电机提出了初始数据样本，审阅者批评其质量和多样性，而裁决者则解决了冲突以最终确定输出。通过将合成过程分解为专业的子任务，协作小型LLM可以通过大型基于LLM的蒸馏来实现数据级别的奇偶校验。通过跨多个基准测试的实验，我们证明了GRA产生的数据匹配或超过单个大型LLM输出的质量，例如QWEN-2.5-72B-INSTRUCTION。我们的结果挑战了用于高质量数据综合的单层大型模型的必要性，而是提倡进行较小代理的战略协调。我们的数据集，模型和代码可在此HTTPS URL上公开获得。

Title: The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation

Authors: Zheng Zhang, Ning Li, Qi Liu, Rui Li, Weibo Gao, Qingyang Mao, Zhenya Huang, Baosheng Yu, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12323
Pdf URL: https://arxiv.org/pdf/2504.12323
Copy Paste: [[2504.12323]] The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation(https://arxiv.org/abs/2504.12323)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. By referencing this external knowledge, RAG effectively reduces the generation of factually incorrect content and addresses hallucination issues within LLMs. Recently, there has been growing attention to improving the performance and efficiency of RAG systems from various perspectives. While these advancements have yielded significant results, the application of RAG in domains with considerable societal implications raises a critical question about fairness: What impact does the introduction of the RAG paradigm have on the fairness of LLMs? To address this question, we conduct extensive experiments by varying the LLMs, retrievers, and retrieval sources. Our experimental analysis reveals that the scale of the LLMs plays a significant role in influencing fairness outcomes within the RAG framework. When the model scale is smaller than 8B, the integration of retrieval mechanisms often exacerbates unfairness in small-scale LLMs (e.g., LLaMA3.2-1B, Mistral-7B, and LLaMA3-8B). To mitigate the fairness issues introduced by RAG for small-scale LLMs, we propose two approaches, FairFT and FairFilter. Specifically, in FairFT, we align the retriever with the LLM in terms of fairness, enabling it to retrieve documents that facilitate fairer model outputs. In FairFilter, we propose a fairness filtering mechanism to filter out biased content after retrieval. Finally, we validate our proposed approaches on real-world datasets, demonstrating their effectiveness in improving fairness while maintaining performance.
摘要：通过从外部知识来源检索相关文档，检索增强的生成（RAG）增强了大语言模型（LLM）。通过引用这种外部知识，RAG有效地减少了事实不正确的内容的产生，并解决了LLM中的幻觉问题。最近，从各个角度来看，人们对提高破布系统的性能和效率的关注越来越大。尽管这些进步取得了重大结果，但在具有相当大社会影响的领域中，抹布的应用引发了一个关键的问题：抹布范式的引入对LLMS的公平有何影响？为了解决这个问题，我们通过改变LLM，猎犬和检索来源进行广泛的实验。我们的实验分析表明，LLMS的规模在影响RAG框架内的公平结局中起着重要作用。当模型量表小于8b时，检索机制的整合通常会加剧小规模LLMS（例如Llama3.2-1b，Mismtral-7b和Llama3-8b）的不公平性。为了减轻RAG为小规模LLMS引入的公平问题，我们提出了两种方法，即Fairft和Fairfilter。具体来说，在Fairft中，我们就公平性而将猎犬与LLM保持一致，使其能够检索促进更公平模型输出的文档。在Fairfilter中，我们提出了一种公平过滤机制，以在检索后过滤掉偏置的内容。最后，我们验证了我们在现实世界数据集上提出的方法，证明了它们在维持绩效的同时提高公平性方面的有效性。

Title: Cross-Document Cross-Lingual Natural Language Inference via RST-enhanced Graph Fusion and Interpretability Prediction

Authors: Mengying Yuan, Wangzi Xuan, Fei Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12324
Pdf URL: https://arxiv.org/pdf/2504.12324
Copy Paste: [[2504.12324]] Cross-Document Cross-Lingual Natural Language Inference via RST-enhanced Graph Fusion and Interpretability Prediction(https://arxiv.org/abs/2504.12324)
Keywords: gpt, llm
Abstract: Natural Language Inference (NLI) is a fundamental task in both natural language processing and information retrieval. While NLI has developed many sub-directions such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm for CDCL-NLI that extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 1,110 instances and spanning 26 languages. To build a baseline for this task, we also propose an innovative method that integrates RST-enhanced graph fusion and interpretability prediction. Our method employs RST (Rhetorical Structure Theory) on RGAT (Relation-aware Graph Attention Network) for cross-document context modeling, coupled with a structure-aware semantic alignment mechanism based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU-level attribution framework that generates extractive explanations. Extensive experiments demonstrate our approach's superior performance, achieving significant improvements over both traditional NLI models such as DocNLI and R2F, as well as LLMs like Llama3 and GPT-4o. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, semantic retrieval and interpretability inference. Our dataset and code are available at \href{this https URL}{CDCL-NLI-Link for peer review}.
摘要：自然语言推论（NLI）是自然语言处理和信息检索的基本任务。尽管NLI已经开发了许多子方向，例如句子级NLI，文档级NLI和跨语言NLI，但跨文档跨语言NLI（CDCL-NLI）仍然在很大程度上尚未探索。在本文中，我们提出了一个新颖的CDCL-NLI范式，该范式将传统的NLI功能扩展到多文章的多语言场景。为了支持此任务，我们构建了一个高质量的CDCL-NLI数据集，其中包括1,110个实例和跨越26种语言。为了为该任务构建基线，我们还提出了一种创新方法，该方法集成了第一个增强的图形融合和可解释性预测。我们的方法在跨文档上下文建模上采用了RGAT（关系感知图表网络）上的RST（修辞结构理论），并与基于词汇链的结构感知的语义对准机制结合，以进行跨语性理解。对于NLI的可解释性，我们开发了一个EDU级归因框架，该框架生成了提取性解释。广泛的实验证明了我们的方法的出色性能，比传统的NLI模型（如DOCNLI和R2F）以及LLMS等LLAMA3和GPT-4O等LLM都取得了重大改进。我们的工作阐明了对NLI的研究，并将对跨文章的跨语言上下文理解，语义检索和解释性推论带来研究兴趣。我们的数据集和代码可在\ href {this HTTPS url} {CDCL-NLI-link进行对等评论}中获得。

Title: LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media

Authors: Haiqi Zhang, Zhengyuan Zhu, Zeyu Zhang, Chengkai Li
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2504.12325
Pdf URL: https://arxiv.org/pdf/2504.12325
Copy Paste: [[2504.12325]] LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media(https://arxiv.org/abs/2504.12325)
Keywords: language model, gpt, llm
Abstract: With the vast expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomy of factual claims from social media by generating topics from multi-level granularities. This approach aids stakeholders in more effectively navigating the social media landscapes. We implement this framework with different models across three distinct datasets and introduce specially designed taxonomy evaluation metrics for a comprehensive assessment. With the evaluations from both human evaluators and GPT-4, the results indicate that LLMTaxo effectively categorizes factual claims from social media, and reveals that certain models perform better on specific datasets.
摘要：随着社交媒体平台上内容的广泛扩展，分析和理解在线话语变得越来越复杂。本文介绍了LLMTAXO，这是一个新颖的框架，利用大型语言模型来自动构造社交媒体的事实主张的分类法，通过产生多级粒度的主题。这种方法有助于利益相关者更有效地导航社交媒体景观。我们在三个不同的数据集中使用不同的模型实施了此框架，并引入了专门设计的分类学评估指标，以进行全面评估。通过人类评估者和GPT-4的评估，结果表明LLMTAXO有效地对社交媒体的事实主张进行了分类，并揭示了某些模型在特定数据集上的表现更好。

Title: Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis

Authors: Shahriar Noroozizadeh, Jeremy C. Weiss
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12326
Pdf URL: https://arxiv.org/pdf/2504.12326
Copy Paste: [[2504.12326]] Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis(https://arxiv.org/abs/2504.12326)
Keywords: language model, llm
Abstract: Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary data structured streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the Pubmed-Open Access (PMOA) Subset. To validate our system, we apply it on PMOA and timeline annotations from I2B2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: O1-preview--0.755, Llama 3.3 70B Instruct--0.753) and strong temporal ordering (concordance: O1-preview--0.932, Llama 3.3 70B Instruct--0.932). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
摘要：临床病例报告和出院摘要可能是对患者遇到的最完整，最准确的汇总，但它们是最终确定的，即相遇后的时间流程。互补的数据结构化流很快就可以使用，但遭受了不完整的困扰。为了在更完整且更详细的数据上训练模型和算法，我们使用大语言模型在病例报告中构建了一条管道，以进行表型，提取和注释时间重新定位的发现。我们应用管道来生成一个开放式文本时间序列corpus，用于sepsis-3，其中包括PubMed-Open访问（PMOA）子集的2,139个病例报告。为了验证我们的系统，我们将其应用于I2B2/MIMIC-IV的PMOA和时间轴注释，并将结果与医师 - 专家注释进行比较。我们显示出临床发现的高回收率（事件匹配率：O1-preiview - 0.755，Llama 3.3 70B指令 - 0.753）和强烈的时间顺序（一致性：O1-Preview - 0.932，Llama 3.3 70B指令 - 0.932）。我们的工作表征了LLM在文本中的临床发现的能力，这说明了LLM用于时间重建的局限性，并通过多模式整合提供了一些改进的潜在途径。

Title: A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future

Authors: Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, Lei Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12328
Pdf URL: https://arxiv.org/pdf/2504.12328
Copy Paste: [[2504.12328]] A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future(https://arxiv.org/abs/2504.12328)
Keywords: language model, llm
Abstract: Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{this https URL}.
摘要：奖励模型（RM）表现出了增强大型语言模型（LLM）的令人印象深刻的潜力，因为RM可以作为人类偏好的代理，提供信号来指导LLMS在各种任务中的行为。在本文中，我们提供了相关研究的全面概述，从偏好收集，奖励建模和用法的角度探索RMS。接下来，我们介绍RMS的应用，并讨论评估的基准。此外，我们对现有的挑战进行了深入的分析，并涉足潜在的研究方向。本文致力于为初学者提供全面的RMS介绍和促进未来的研究。这些资源可在GitHub \ footNote {此HTTPS URL}上公开获得。

Title: HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Authors: Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12330
Pdf URL: https://arxiv.org/pdf/2504.12330
Copy Paste: [[2504.12330]] HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation(https://arxiv.org/abs/2504.12330)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation, agent
Abstract: While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at this https URL.
摘要：尽管检索型发电（RAG）增强了具有外部知识的大语言模型（LLMS），但传统的单格抹布在解决复杂的查询方面基本上仍有限制，要求在各个异质数据生态系统中进行协调推理。我们提出了HM-rag，这是一种新型的层次多代理多模式抹布框架，开创了协作智能，以跨结构化，非结构化和基于图形的数据进行动态知识综合。该框架由具有专业代理的三层体系结构组成：一种分解剂，通过语义吸引的查询重写和模式引导的上下文增强将复杂查询分解为上下文相干的子任务；使用专为向量，图形和基于Web的数据库设计的插件模块进行平行，模式特定检索的多源检索剂；以及使用一致性投票来整合多源答案并通过专家模型改进来检索结果的决策代理。该体系结构通过结合文本，相关和网络衍生的证据来获得全面的查询理解，从而使答案准确性提高了12.95％，并且在ScienceQA和Crisismismmd基准的基线抹布系统上提高了3.56％的分类精度。值得注意的是，HM-rag在两个数据集上建立了最先进的结果。它的模块化体系结构确保了新数据模式的无缝集成，同时保持严格的数据治理，这标志着在解决抹布系统中多模式推理和知识综合的关键挑战方面的重大进步。代码可在此HTTPS URL上找到。

Title: Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation

Authors: Xiangju Li, Dong Yang, Xiaogang Zhu, Faliang Huang, Peng Zhang, Zhongying Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12331
Pdf URL: https://arxiv.org/pdf/2504.12331
Copy Paste: [[2504.12331]] Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation(https://arxiv.org/abs/2504.12331)
Keywords: language model, llm, prompt
Abstract: Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause detection, these methods often confront challenges originating from redundant information retrieval and difficulty in accurately determining emotion categories, particularly when emotions are expressed implicitly or ambiguously. To overcome these challenges, this study explores a fine-grained approach to span-level emotion-cause-category triplet extraction and introduces an innovative framework that leverages instruction tuning and data augmentation techniques based on large language models. The proposed method employs task-specific triplet extraction instructions and utilizes low-rank adaptation to fine-tune large language models, eliminating the necessity for intricate task-specific architectures. Furthermore, a prompt-based data augmentation strategy is developed to address data scarcity by guiding large language models in generating high-quality synthetic training data. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms existing baseline methods, achieving at least a 12.8% improvement in span-level emotion-cause-category triplet extraction metrics. The results demonstrate the method's effectiveness and robustness, offering a promising avenue for advancing research in emotion cause analysis. The source code is available at this https URL.
摘要：跨度级的情感原因类别三重萃取代表了情绪原因分析中的新颖而复杂的挑战。该任务涉及识别情绪跨度，引起跨度及其在文本中相关的情绪类别形成结构化的三胞胎。虽然先前的研究主要集中在条款级的情感因素对萃取和跨度级的情绪检测上，但这些方法通常面临着源于冗余信息检索和难度准确确定情绪类别的挑战，尤其是当情绪被隐式或歧义表达时。为了克服这些挑战，这项研究探讨了一种跨度情感因素的细粒方法，并引入了一个创新的框架，该框架利用了基于大语言模型的教学调整和数据增强技术。所提出的方法采用特定于任务的三重态提取说明，并利用低级别适应大型语言模型，从而消除了复杂的特定任务特定体系结构的必要性。此外，开发了一种基于及时的数据增强策略来解决数据稀缺，通过指导大型语言模型生成高质量的合成训练数据。广泛的实验评估表明，所提出的方法显着胜过现有的基线方法，至少提高了跨度级别的情感类别类别三重指标，至少提高了12.8％。结果证明了该方法的有效性和鲁棒性，为推进情感原因分析的研究提供了有希望的途径。源代码可在此HTTPS URL上找到。

Title: Can the capability of Large Language Models be described by human ability? A Meta Study

Authors: Mingrui Zan, Yunquan Zhang, Boyang Zhang, Fangming Liu, Daning Cheng
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.12332
Pdf URL: https://arxiv.org/pdf/2504.12332
Copy Paste: [[2504.12332]] Can the capability of Large Language Models be described by human ability? A Meta Study(https://arxiv.org/abs/2504.12332)
Keywords: language model, llm
Abstract: Users of Large Language Models (LLMs) often perceive these models as intelligent entities with human-like capabilities. However, the extent to which LLMs' capabilities truly approximate human abilities remains a topic of debate. In this paper, to characterize the capabilities of LLMs in relation to human capabilities, we collected performance data from over 80 models across 37 evaluation benchmarks. The evaluation benchmarks are categorized into 6 primary abilities and 11 sub-abilities in human aspect. Then, we then clustered the performance rankings into several categories and compared these clustering results with classifications based on human ability aspects. Our findings lead to the following conclusions: 1. We have confirmed that certain capabilities of LLMs with fewer than 10 billion parameters can indeed be described using human ability metrics; 2. While some abilities are considered interrelated in humans, they appear nearly uncorrelated in LLMs; 3. The capabilities possessed by LLMs vary significantly with the parameter scale of the model.
摘要：大型语言模型（LLM）的用户通常将这些模型视为具有类似人类功能的智能实体。但是，LLMS的能力真正近似人类能力的程度仍然是一个辩论的话题。在本文中，为了表征LLM与人类能力相关的功能，我们从37个评估基准中收集了80多个模型的绩效数据。评估基准分为人工方面的6个主要能力和11个亚属性。然后，我们将性能排名分为几个类别，并将这些聚类结果与基于人类能力方面的分类进行了比较。我们的发现得出了以下结论：1。我们已经证实，确实可以使用人类能力指标来描述具有少于100亿参数的LLM的某些功能。 2。尽管某些能力在人类中被认为是相互关联的，但在LLM中似乎几乎不相关。 3。LLM所具有的功能随模型的参数量表有显着变化。

Title: Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games

Authors: Andrés Isaza-Giraldo, Paulo Bala, Lucas Pereira
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.12333
Pdf URL: https://arxiv.org/pdf/2504.12333
Copy Paste: [[2504.12333]] Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games(https://arxiv.org/abs/2504.12333)
Keywords: language model, llm
Abstract: The evaluation of open-ended responses in serious games presents a unique challenge, as correctness is often subjective. Large Language Models (LLMs) are increasingly being explored as evaluators in such contexts, yet their accuracy and consistency remain uncertain, particularly for smaller models intended for local execution. This study investigates the reliability of five small-scale LLMs when assessing player responses in \textit{En-join}, a game that simulates decision-making within energy communities. By leveraging traditional binary classification metrics (including accuracy, true positive rate, and true negative rate), we systematically compare these models across different evaluation scenarios. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance. We demonstrate that while some models excel at identifying correct responses, others struggle with false positives or inconsistent evaluations. The findings highlight the need for context-aware evaluation frameworks and careful model selection when deploying LLMs as evaluators. This work contributes to the broader discourse on the trustworthiness of AI-driven assessment tools, offering insights into how different LLM architectures handle subjective evaluation tasks.
摘要：严重游戏中对开放式响应的评估提出了一个独特的挑战，因为正确性通常是主观的。在这种情况下，大型语言模型（LLMS）越来越多地作为评估者探索，但它们的准确性和一致性仍然不确定，特别是对于旨在供本地执行的较小模型而言。这项研究调查了在评估\ textit {en-join}的玩家响应时，五个小规模LLM的可靠性，该游戏模拟能源社区内的决策。通过利用传统的二元分类指标（包括准确性，真实的正率和真实的负率），我们可以系统地比较不同评估方案的这些模型。我们的结果突出了每个模型的优势和局限性，揭示了灵敏度，特异性和整体性能之间的权衡。我们证明，尽管某些模型在识别正确的响应方面表现出色，但另一些模型则在误报或不一致的评估中挣扎。这些发现突出了将LLMS作为评估者时的上下文感知评估框架和仔细的模型选择的必要性。这项工作有助于对AI驱动的评估工具的可信赖性的更广泛的论述，从而提供了有关不同LLM体系结构如何处理主观评估任务的见解。

Title: QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Authors: Zongxian Yang, Jiayu Qian, Zhi-An Huang, Kay Chen Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12334
Pdf URL: https://arxiv.org/pdf/2504.12334
Copy Paste: [[2504.12334]] QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model(https://arxiv.org/abs/2504.12334)
Keywords: language model, llm
Abstract: Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the this http URL work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.
摘要：大型语言模型（LLMS）由于医学推理的固有复杂性和临床数据的敏感性而在专业生物医学任务中面临重大挑战。现有的LLM经常在复杂的医学术语和准确的临床见解中遇到困难，从而导致量化资源受限部署时的性能降低。为了解决这些问题，我们提出了量化的医学树（QM-TOT），这是一个基于路径的推理框架。 QM-TOT利用一种思想树（TOT）推理方法将复杂的医疗问题分解为可管理的子任务，并加上评估器评估层。该框架有助于在具有挑战性的MEDQAUSME数据集中进行INT4定量模型的实质性改进。具体而言，我们证明了Llama2-70b模型的准确性从34％增加到50％，而Llama-3.1-8B的准确性从58.77％到69.49％。此外，我们还提出了一种基于TOT的效果数据蒸馏方法。与传统的蒸馏方法相比，我们首次使用了HTTP URL工作的3.9％，提高了86％。27％，这首次展示了TOT显着提高复杂生物医学任务的绩效的潜力，从而为在资源范围内部署高级量化的量化LLM在资源中实现了未来的进步至关重要的基础。

Title: You've Changed: Detecting Modification of Black-Box Large Language Models

Authors: Alden Dima, James Foulds, Shimei Pan, Philip Feldman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12335
Pdf URL: https://arxiv.org/pdf/2504.12335
Copy Paste: [[2504.12335]] You've Changed: Detecting Modification of Black-Box Large Language Models(https://arxiv.org/abs/2504.12335)
Keywords: language model, llm, prompt, chat
Abstract: Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text. Our method uses a statistical test to determine whether the distributions of features from two samples of text are equivalent, allowing developers to identify when an LLM has changed. We demonstrate the effectiveness of our approach using five OpenAI completion models and Meta's Llama 3 70B chat model. Our results show that simple text features coupled with a statistical test can distinguish between language models. We also explore the use of our approach to detect prompt injection attacks. Our work enables frequent LLM change monitoring and avoids computationally expensive benchmark evaluations.
摘要：大型语言模型（LLMS）通常是通过API作为服务提供的，这使开发人员检测其行为变化具有挑战性。我们提出了一种通过比较生成文本的语言和心理语言特征的分布来监视LLM的方法。我们的方法使用统计测试来确定两个文本样本的特征分布是否等效，从而允许开发人员识别LLM何时更改。我们使用五个OpenAI完成模型和Meta的Llama 3 70B聊天模型证明了方法的有效性。我们的结果表明，简单的文本功能以及统计测试可以区分语言模型。我们还探讨了使用我们的方法检测快速注射攻击的方法。我们的工作使LLM频繁更改监视，并避免计算昂贵的基准评估。

Title: "It Listens Better Than My Therapist": Exploring Social Media Discourse on LLMs as Mental Health Tool

Authors: Anna-Carolina Haensch
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2504.12337
Pdf URL: https://arxiv.org/pdf/2504.12337
Copy Paste: [[2504.12337]] "It Listens Better Than My Therapist": Exploring Social Media Discourse on LLMs as Mental Health Tool(https://arxiv.org/abs/2504.12337)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The emergence of generative AI chatbots such as ChatGPT has prompted growing public and academic interest in their role as informal mental health support tools. While early rule-based systems have been around for several years, large language models (LLMs) offer new capabilities in conversational fluency, empathy simulation, and availability. This study explores how users engage with LLMs as mental health tools by analyzing over 10,000 TikTok comments from videos referencing LLMs as mental health tools. Using a self-developed tiered coding schema and supervised classification models, we identify user experiences, attitudes, and recurring themes. Results show that nearly 20% of comments reflect personal use, with these users expressing overwhelmingly positive attitudes. Commonly cited benefits include accessibility, emotional support, and perceived therapeutic value. However, concerns around privacy, generic responses, and the lack of professional oversight remain prominent. It is important to note that the user feedback does not indicate which therapeutic framework, if any, the LLM-generated output aligns with. While the findings underscore the growing relevance of AI in everyday practices, they also highlight the urgent need for clinical and ethical scrutiny in the use of AI for mental health support.
摘要：诸如Chatgpt之类的生成AI聊天机器人的出现促使公众对他们作为非正式心理健康支持工具的作用越来越多。尽管基于早期规则的系统已经存在了几年，但大型语言模型（LLMS）在会话流利，移情模拟和可用性方面提供了新的功能。这项研究探讨了用户如何通过分析10,000多个Tiktok评论的评论将LLMS作为心理健康工具互动，将LLMS作为心理健康工具的视频评论。使用自发的分层编码模式和监督分类模型，我们确定用户体验，态度和重复出现的主题。结果表明，近20％的评论反映了个人用途，这些用户表达了绝对积极的态度。普遍提到的好处包括可及性，情感支持和感知的治疗价值。但是，对隐私，通用反应以及缺乏专业监督的担忧仍然很突出。重要的是要注意，用户反馈并未指示LLM生成的输出与哪种治疗框架（如果有的话）。尽管这些发现强调了AI在日常实践中日益增长的相关性，但它们也强调了迫切需要对使用AI进行心理健康支持的临床和道德审查。

Title: Paging Dr. GPT: Extracting Information from Clinical Notes to Enhance Patient Predictions

Authors: David Anderson, Michaela Anderson, Margret Bjarnadottir, Stephen Mahar, Shriyan Reyya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12338
Pdf URL: https://arxiv.org/pdf/2504.12338
Copy Paste: [[2504.12338]] Paging Dr. GPT: Extracting Information from Clinical Notes to Enhance Patient Predictions(https://arxiv.org/abs/2504.12338)
Keywords: language model, gpt, llm, chat
Abstract: There is a long history of building predictive models in healthcare using tabular data from electronic medical records. However, these models fail to extract the information found in unstructured clinical notes, which document diagnosis, treatment, progress, medications, and care plans. In this study, we investigate how answers generated by GPT-4o-mini (ChatGPT) to simple clinical questions about patients, when given access to the patient's discharge summary, can support patient-level mortality prediction. Using data from 14,011 first-time admissions to the Coronary Care or Cardiovascular Intensive Care Units in the MIMIC-IV Note dataset, we implement a transparent framework that uses GPT responses as input features in logistic regression models. Our findings demonstrate that GPT-based models alone can outperform models trained on standard tabular data, and that combining both sources of information yields even greater predictive power, increasing AUC by an average of 5.1 percentage points and increasing positive predictive value by 29.9 percent for the highest-risk decile. These results highlight the value of integrating large language models (LLMs) into clinical prediction tasks and underscore the broader potential for using LLMs in any domain where unstructured text data remains an underutilized resource.
摘要：使用电子病历中的表格数据，在医疗保健中建立预测模型的历史悠久。但是，这些模型未能提取非结构化临床注释中发现的信息，这些信息记录了诊断，治疗，进度，药物和护理计划。在这项研究中，我们研究了GPT-4O-MINI（CHATGPT）对患者的简单临床问题产生的答案，如果可以访问患者的出院摘要，则如何支持患者级死亡率预测。使用来自MIMIC-IV Note数据集中的冠状动脉护理或心血管重症监护病房的14,011首次入学的数据，我们实施了一个透明的框架，该框架将GPT响应用作Logistic Repression模型中的输入功能。我们的发现表明，仅基于GPT的模型就可以胜过对标准表格数据进行训练的模型，并且将两个信息源组合起来都会产生更大的预测能力，使AUC平均增加5.1个百分点，并使最高风险Decile的正面预测价值增加29.9％。这些结果突出了将大语言模型（LLM）集成到临床预测任务中的价值，并强调了在非组织文本数据仍然是未充分利用资源的任何领域中使用LLM的广泛潜力。

Title: GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

Authors: Yaodong Song, Hongjie Chen, Jie Lian, Yuxin Zhang, Guangmin Xia, Zehan Li, Genliang Zhao, Jian Kang, Yongxiang Li, Jie Li
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2504.12339
Pdf URL: https://arxiv.org/pdf/2504.12339
Copy Paste: [[2504.12339]] GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture(https://arxiv.org/abs/2504.12339)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.
摘要：尽管大型语言模型（LLM）通过离散的令牌化范式彻底改变了文本到语音（TTS）的综合，但当前的体系结构在三个关键方面之间表现出基本的紧张局势：1）由语音提示量化量化引起的声学特征的不可逆转损失； 2）严格依赖精确对齐的及时的语音文本对，从而限制现实世界的部署； 3）在优化语音令牌生成期间，灾难性忘记了LLM的本地文本理解。为了应对这些挑战，我们提出了一种基于LLM的文本到语音生成方法，该方法通过新颖的双分支体系结构（Goat-TTS）进行了优化。我们的框架介绍了两个关键的创新：（1）模态对准分支结合了语音编码器和投影仪，以捕获连续的声学嵌入，从而在副语言特征（语言，音色，情感）和没有成绩单依赖性的情况下实现了副语言特征（语言，音色，情感）之间的双向相关性；（2）语音产生分支在LLM的顶部K层上采用模块化微调进行语音令牌预测，同时冷冻底部K层以保留基础语言知识。此外，引入了多键预测以支持实时流TTS合成。实验结果表明，我们的山羊TTS实现与最先进的TTS模型相当的性能，同时验证合成方言语音数据的功效。

Title: Streamlining Biomedical Research with Specialized LLMs

Authors: Linqing Chen, Weilei Wang, Yubin Xia, Wentao Wu, Peng Xu, Zilong Bai, Jie Fang, Chaobo Xu, Ran Hu, Licong Xu, Haoran Hua, Jing Sun, Hanmeng Zhong, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yong Gu, Tao Shi, Chaochao Wang, Jianping Lu, Cheng Sun, Yixin Wang, Shengjie Yang, Yuancheng Li, Lu Jin, Lisha Zhang, Fu Bian, Zhongkai Ye, Lidong Pei, Changyang Tu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12341
Pdf URL: https://arxiv.org/pdf/2504.12341
Copy Paste: [[2504.12341]] Streamlining Biomedical Research with Specialized LLMs(https://arxiv.org/abs/2504.12341)
Keywords: language model, llm
Abstract: In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, images, tables, and other modalities. We demonstrate the system's capability to enhance response precision by leveraging a robust question-answering model, significantly improving the quality of dialogue generation. The system provides an accessible platform for real-time, high-fidelity interactions, allowing users to benefit from efficient human-computer interaction, precise retrieval, and simultaneous access to a wide range of literature and data. This dramatically improves the research efficiency of professionals in the biomedical and pharmaceutical domains and facilitates faster, more informed decision-making throughout the R\&D process. Furthermore, the system proposed in this paper is available at this https URL.
摘要：在本文中，我们提出了一个新型系统，该系统将最先进的特定领域的大语言模型与先进的信息检索技术相结合，以提供全面和背景感知的响应。我们的方法促进了各种组件之间的无缝相互作用，从而实现了输出的交叉验证，以产生具有相关数据，图像，表和其他方式的准确，高质量的响应。我们通过利用强大的提问模型来证明该系统可以提高回答精度的能力，从而显着提高了对话生成的质量。该系统为实时，高保真互动提供了可访问的平台，使用户可以从有效的人类计算机互动，精确的检索以及同时访问广泛的文献和数据中受益。这大大提高了生物医学和制药领域的专业人员的研究效率，并在整个R \＆d过程中更快，更明智的决策促进了速度。此外，本文提出的系统可在此HTTPS URL上找到。

Title: Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation

Authors: Hanmeng Zhong, Linqing Chen, Weilei Wang, Wentao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12342
Pdf URL: https://arxiv.org/pdf/2504.12342
Copy Paste: [[2504.12342]] Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation(https://arxiv.org/abs/2504.12342)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs' Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.
摘要：最近，在特定领域的检索增强大型语言模型（LLM）的应用引起了极大的关注，尤其是在生物制药中。但是，在这种情况下，没有专门为生物制药来评估LLM的基准。在本文中，我们介绍了Biopharmaceuticals检索仪式的生成评估（BRAGE），这是第一个用于评估LLMS的查询和参考理解能力（QRUC）的基准，该基准在生物制药域中，可用于英语，法语，法语，德国和中文。此外，在开放式检索式的质量标准中，诸如准确性和精确匹配之类的传统问题（QA）指标（QA）的标准不足。为了解决这个问题，我们提出了一种基于引用的分类方法，以评估LLMS的QRUC，以了解查询与参考之间的关系。我们应用这种方法来评估Brage上的主流LLM。实验结果表明，主流LLM的生物药物QRUC存在显着差距，并且需要改善其QRUC。

Title: Propaganda via AI? A Study on Semantic Backdoors in Large Language Models

Authors: Nay Myat Min, Long H. Pham, Yige Li, Jun Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12344
Pdf URL: https://arxiv.org/pdf/2504.12344
Copy Paste: [[2504.12344]] Propaganda via AI? A Study on Semantic Backdoors in Large Language Models(https://arxiv.org/abs/2504.12344)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for "Response Anomaly Vigilance for uncovering semantic backdoors"), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at this https URL.
摘要：大型语言模型（LLMS）在无数语言任务中表现出了出色的性能，但它们仍然容易受到后门攻击的影响，在该攻击中，对手会植入系统操纵模型输出的隐藏触发器。传统的防御力集中在明确的令牌级异常上，因此忽略了嵌入在概念层面（例如，意识形态上的立场或文化参考）的语义后门 - 覆盖物，这些触发器依赖于基于意义的提示而不是词汇奇怪。我们首先在受控的填充环境中表明，这种语义后门只有一个小毒的语料库可以植入，从而确定了它们的可行性。然后，我们在LLMS中形式化了语义后门的概念，并引入了黑盒检测框架，即Raven（“响应响应异常警惕性，以揭示语义后门的响应异常警惕”），该框架将语义熵与跨模型一致性分析相结合。该框架探究了带有结构化主题的提示的多个模型，通过双向构成将采样响应簇，并标记为异常均匀的输出；跨模型比较分离物与范围偏差的模型特异性异常。各种LLM家族（GPT-4O，Llama，DeepSeek，Mistral）之间的经验评估揭示了以前未被发现的语义后门，提供了这些隐藏脆弱性的首次概念证明证据，并强调了对已部署语言模型的概念级别审核的迫切需求。我们在此HTTPS URL上开放代码和数据。

Title: Reimagining Urban Science: Scaling Causal Inference with Large Language Models

Authors: Yutong Xia, Ao Qu, Yunhan Zheng, Yihong Tang, Dingyi Zhuang, Yuxuan Liang, Cathy Wu, Roger Zimmermann, Jinhua Zhao
Subjects: cs.CL, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2504.12345
Pdf URL: https://arxiv.org/pdf/2504.12345
Copy Paste: [[2504.12345]] Reimagining Urban Science: Scaling Causal Inference with Large Language Models(https://arxiv.org/abs/2504.12345)
Keywords: language model, llm, agent
Abstract: Urban causal research is essential for understanding the complex dynamics of cities and informing evidence-based policies. However, it is challenged by the inefficiency and bias of hypothesis generation, barriers to multimodal data complexity, and the methodological fragility of causal experimentation. Recent advances in large language models (LLMs) present an opportunity to rethink how urban causal analysis is conducted. This Perspective examines current urban causal research by analyzing taxonomies that categorize research topics, data sources, and methodological approaches to identify structural gaps. We then introduce an LLM-driven conceptual framework, AutoUrbanCI, composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy recommendations. We propose evaluation criteria for rigor and transparency and reflect on implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces AI-augmented workflows not as replacements for human expertise but as tools to broaden participation, improve reproducibility, and unlock more inclusive forms of urban causal reasoning.
摘要：城市因果研究对于理解城市的复杂动态和告知循证政策至关重要。然而，这是由于假设产生的效率低下和偏见，多模式数据复杂性的障碍以及因果实验的方法学脆弱性所挑战。大型语言模型（LLM）的最新进展为重新思考城市因果分析的机会提供了机会。该观点通过分析分类法对研究主题，数据源和方法论方法进行分类来研究当前的城市因果研究。然后，我们引入了一个由LLM驱动的概念框架，即Autourbanci，由四个不同的模块化代理组成，负责假设生成，数据工程，实验设计和执行，并通过政策建议进行解释。我们提出了严格和透明度的评估标准，并反思对人类协作，公平和问责制的影响。我们呼吁制定一个新的研究议程，该议程涵盖了ai-augment的工作流程，而不是替代人类专业知识，而是作为扩大参与，提高可重复性并解锁更具包容性的城市因果推理的工具。

Title: Mathematical Capabilities of Large Language Models in Finnish Matriculation Examination

Authors: Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2504.12347
Pdf URL: https://arxiv.org/pdf/2504.12347
Copy Paste: [[2504.12347]] Mathematical Capabilities of Large Language Models in Finnish Matriculation Examination(https://arxiv.org/abs/2504.12347)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential to also support educational assessments at scale.
摘要：大型语言模型（LLM）在教育环境中表现出了越来越多的希望，但他们的数学推理被认为正在发展。这项研究使用Finnish入学考试评估了各种LLM的数学能力，这是一种高风险的中等中等教育数字测试。初始测试产生的中等性能对应于中距离等级，但随着语言模型的发展，后来的评估表现出了很大的改进。值得注意的是，一些模型取得了近乎完美或完美的成绩，匹配了顶尖的学生表现并获得大学入学资格。我们的发现突出了LLM的数学水平的快速进步，并说明了它们在大规模上支持教育评估的潜力。

Title: A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports

Authors: Jing Wang, Jeremy C Weiss
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12350
Pdf URL: https://arxiv.org/pdf/2504.12350
Copy Paste: [[2504.12350]] A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports(https://arxiv.org/abs/2504.12350)
Keywords: language model, llm
Abstract: Timing of clinical events is central to characterization of patient trajectories, enabling analyses such as process tracing, forecasting, and causal reasoning. However, structured electronic health records capture few data elements critical to these tasks, while clinical reports lack temporal localization of events in structured form. We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps. We contrast manual and large language model (LLM) annotations (n=320 and n=390 respectively) of ten randomly-sampled PubMed open-access (PMOA) case reports (N=152,974) and assess inter-LLM agreement (n=3,103; N=93). We find that the LLM models have moderate event recall(O1-preview: 0.80) but high temporal concordance among identified events (O1-preview: 0.95). By establishing the task, annotation, and assessment systems, and by demonstrating high concordance, this work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.
摘要：临床事件的时机对于患者轨迹的表征是至关重要的，可以进行诸如过程追踪，预测和因果推理之类的分析。但是，结构化的电子健康记录捕获了对这些任务至关重要的数据元素，而临床报告缺乏结构化形式的事件的时间定位。我们提出了一个系统，将案例报告转换为文本时间序列的文本事件和时间戳。我们将手册和大型语言模型（LLM）注释（分别为n = 320和n = 390）与十个随机采样的PubMed开放式访问（PMOA）案例报告（n = 152,974），并评估Inter-LLM协议（n = 3,103; n = 93）。我们发现LLM模型具有适度的事件召回（O1-preview：0.80），但已确定的事件之间的时间一致性很高（O1-Preview：0.95）。通过建立任务，注释和评估系统，并通过证明高调，这项工作可以作为利用PMOA语料库进行时间分析的基准。

Title: Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media

Authors: Muhammad Ahmad, Muhammad Waqas, ldar Batyrshin, Grigori Sidorov
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2504.12355
Pdf URL: https://arxiv.org/pdf/2504.12355
Copy Paste: [[2504.12355]] Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media(https://arxiv.org/abs/2504.12355)
Keywords: language model, llm
Abstract: Drug overdose remains a critical global health issue, often driven by misuse of opioids, painkillers, and psychiatric medications. Traditional research methods face limitations, whereas social media offers real-time insights into self-reported substance use and overdose symptoms. This study proposes an AI-driven NLP framework trained on annotated social media data to detect commonly used drugs and associated overdose symptoms. Using a hybrid annotation strategy with LLMs and human annotators, we applied traditional ML models, neural networks, and advanced transformer-based models. Our framework achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baseline models by up to 8%. These findings highlight the potential of AI for supporting public health surveillance and personalized intervention strategies.
摘要：药物过量仍然是一个关键的全球健康问题，通常是由阿片类药物，止痛药和精神病药物滥用驱动的。传统的研究方法面临局限性，而社交媒体则提供了对自我报告的药物使用和过量症状的实时见解。这项研究提出了一个AI驱动的NLP框架，该框架在注释的社交媒体数据上训练，以检测常用的药物和相关的过量症状。使用LLM和人类注释者的混合注释策略，我们应用了传统的ML模型，神经网络和基于高级变压器的模型。我们的框架在多类中达到了98％的精度和97％的多标签分类，其表现优于基线模型高达8％。这些发现突出了AI支持公共卫生监视和个性化干预策略的潜力。

Title: Replicating ReLM Results: Validating Large Language Models with ReLM

Authors: Reece Adamson, Erin Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12357
Pdf URL: https://arxiv.org/pdf/2504.12357
Copy Paste: [[2504.12357]] Replicating ReLM Results: Validating Large Language Models with ReLM(https://arxiv.org/abs/2504.12357)
Keywords: language model, llm
Abstract: Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.
摘要：使用RERM验证大型语言模型将探讨正式语言在评估和控制大语模型（LLMS）的应用，以进行记忆，偏见和零拍摄的性能。当前评估这些类型行为的方法通常是缓慢，不精确，昂贵的，或者自己引入偏见，但由于生产LLM时这种行为的重要性是必要的。该项目重现了原始Relm纸的关键结果，并在方法和应用程序上阐述了与机器学习系统领域相关的方法。

Title: Position: The Most Expensive Part of an LLM should be its Training Data

Authors: Nikhil Kandpal, Colin Raffel
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12427
Pdf URL: https://arxiv.org/pdf/2504.12427
Copy Paste: [[2504.12427]] Position: The Most Expensive Part of an LLM should be its Training Data(https://arxiv.org/abs/2504.12427)
Keywords: language model, llm
Abstract: Training a state-of-the-art Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands. Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data. Every LLM is built on an unfathomable amount of human effort: trillions of carefully written words sourced from books, academic papers, codebases, social media, and more. This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers for their work. To support this position, we study 64 LLMs released between 2016 and 2024, estimating what it would cost to pay people to produce their training datasets from scratch. Even under highly conservative estimates of wage rates, the costs of these models' training datasets are 10-1000 times larger than the costs to train the models themselves, representing a significant financial liability for LLM providers. In the face of the massive gap between the value of training data and the lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.
摘要：培训最先进的大语言模型（LLM）是由于日益增长的计算，硬件，能源和工程需求而越来越昂贵的努力。然而，经常被忽视的（很少有付费）费用是这些模型培训数据背后的人工劳动。每个LLM都建立在不可思议的人类努力之上：数万亿个仔细写的单词来自书籍，学术论文，代码库，社交媒体等。该立场论文旨在为这项劳动力分配货币价值，并认为生产LLM最昂贵的部分应该是向培训数据生产者提供的工作的补偿。为了支持这一职位，我们研究了2016年至2024年之间发布的64个LLM，估计向人们付费从头开始生产其培训数据集的费用。即使在高度保守的工资率估计下，这些模型培训数据集的成本也比培训模型本身的成本大10-1000倍，这代表了LLM提供商的重大财务负债。面对培训数据的价值与缺乏创造薪酬之间的巨大差距，我们强调并讨论了可以在将来实现更公平的实践的研究方向。

Title: On Linear Representations and Pretraining Data Frequency in Language Models

Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12459
Pdf URL: https://arxiv.org/pdf/2504.12459
Copy Paste: [[2504.12459]] On Linear Representations and Pretraining Data Frequency in Language Models(https://arxiv.org/abs/2504.12459)
Keywords: language model, gpt
Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.
摘要：预处理数据对语言模型（LMS）的行为和质量有直接影响，但我们只了解这种关系的最基本原理。尽管大多数工作都集中在预处理数据对下游任务行为的影响，但我们研究了数据与LM表示的关系。先前的工作发现，在语言模型中，某些概念在表示形式中编码为“线性”，但是哪些因素导致这些表示形式形成？我们研究了预读取数据频率与模型的事实关系的线性表示之间的联系。我们发现有证据表明，线性表示的形成与预处理术语频率密切相关。专门针对主题相关的事实三重态，主题 - 对象的共同出现频率和关系的内在学习精度都与线性表示高度相关。在预处理的所有阶段中都是这种情况。在OLMO-7B和GPT-J中，我们发现，当关系中的受试者和对象共同存在至少1K和2K次，无论这些事件在预处理过程中何时发生时，线性表示（但不是唯一）形成。最后，我们在全面训练的LMS中对线性表示质量的测量进行训练，该模型可以预测训练中的术语的频率。即使在不同模型的输入中，我们的模型也达到了较低的错误，该模型具有不同的预处理数据集，提供了一种新方法来估计封闭数据模型的原本不知名培训数据的属性。我们得出的结论是，LMS中线性表示的强度包含有关模型预处理的语料库的信号，该语料库可能为控制和改善模型行为提供新的途径：尤其是，操纵模型的训练数据以满足特定的频率阈值。

Title: SLURG: Investigating the Feasibility of Generating Synthetic Online Fallacious Discourse

Authors: Cal Blanco, Gavin Dsouza, Hugo Lin, Chelsey Rush
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12466
Pdf URL: https://arxiv.org/pdf/2504.12466
Copy Paste: [[2504.12466]] SLURG: Investigating the Feasibility of Generating Synthetic Online Fallacious Discourse(https://arxiv.org/abs/2504.12466)
Keywords: language model, llm, prompt
Abstract: In our paper we explore the definition, and extrapolation of fallacies as they pertain to the automatic detection of manipulation on social media. In particular we explore how these logical fallacies might appear in the real world i.e internet forums. We discovered a prevalence of misinformation / misguided intention in discussion boards specifically centered around the Ukrainian Russian Conflict which serves to narrow the domain of our task. Although automatic fallacy detection has gained attention recently, most datasets use unregulated fallacy taxonomies or are limited to formal linguistic domains like political debates or news reports. Online discourse, however, often features non-standardized and diverse language not captured in these domains. We present Shady Linguistic Utterance Replication-Generation (SLURG) to address these limitations, exploring the feasibility of generating synthetic fallacious forum-style comments using large language models (LLMs), specifically DeepHermes-3-Mistral-24B. Our findings indicate that LLMs can replicate the syntactic patterns of real data} and that high-quality few-shot prompts enhance LLMs' ability to mimic the vocabulary diversity of online forums.
摘要：在我们的论文中，我们探讨了与社交媒体上自动检测操作有关的谬论的定义和推断。特别是我们探讨了这些逻辑谬论如何出现在现实世界中，即互联网论坛。我们在讨论委员会中发现了误解 /误导意图的普遍性，该讨论委员会专门围绕乌克兰俄罗斯冲突，这旨在缩小我们任务的领域。尽管自动谬误检测最近引起了人们的关注，但大多数数据集使用不受监管的谬误分类法或仅限于政治辩论或新闻报道等正式语言领域。但是，在线话语通常具有未在这些领域中未捕获的非标准和多样化的语言。我们提出了阴暗的语言话语复制生成（SLURG），以解决这些局限性，探索使用大语言模型（LLMS），特别是DeepHermes-3-Mistral-24b产生合成谬误的论坛风格评论的可行性。我们的发现表明，LLM可以复制真实数据的句法模式}，并且高质量的几个提示提示了LLMS模仿在线论坛的词汇多样性的能力。

Title: Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

Authors: Azadeh Beiranvand, Seyed Mehdi Vahidipour
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12474
Pdf URL: https://arxiv.org/pdf/2504.12474
Copy Paste: [[2504.12474]] Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex(https://arxiv.org/abs/2504.12474)
Keywords: language model, llm, prompt
Abstract: Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.
摘要：文本属性图（TAG）通过要求模型捕获与节点相关文本的语义丰富性和图表的结构依赖性来提出表示在表示学习中的独特挑战。尽管图形神经网络（GNN）在建模拓扑信息方面表现出色，但它们缺乏处理非结构化文本的能力。相反，大型语言模型（LLM）精通文本理解，但通常不知道图形结构。在这项工作中，我们提出了BigTex（双向图文本），这是一种新颖的架构，通过堆叠的图形文本融合单元紧密整合了GNN和LLMS。每个单元允许在文本和结构表示之间相互关注，使信息能够以两个方向流动，影响结构和结构指导文本解释。使用参数有效的微调（LORA）对所提出的体系结构进行了训练，在适应特定于任务的信号的同时，将LLM冷冻。在五个基准数据集上进行的广泛实验表明，BigTex在节点分类中实现了最先进的性能，并有效地概括了链接预测。一项消融研究进一步凸显了模型成功中软提示和双向关注的重要性。

Title: Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

Authors: Hansi Zeng, Kai Hui, Honglei Zhuang, Zhen Qin, Zhenrui Yue, Hamed Zamani, Dana Alon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12491
Pdf URL: https://arxiv.org/pdf/2504.12491
Copy Paste: [[2504.12491]] Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?(https://arxiv.org/abs/2504.12491)
Keywords: llm
Abstract: While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.
摘要：虽然预训练期间可用的指标（例如困惑）与缩放法律研究中的模型性能良好相关，但它们在固定模型尺寸下的预测能力仍不清楚，阻碍了有效的模型选择和开发。为了解决这一差距，我们制定了选择预训练检查点的任务，以最大化下游微调性能作为成对分类问题：预测两个LLM中的哪些（在其预训练方面有所不同）将在受监督的微调（SFT）后表现更好。我们使用具有系统多样化的预训练配置（例如目标或数据）的50 1B参数LLM变体构建数据集，并在SFT后对不同的下游任务进行评估。我们首先进行了一项研究，并证明常规的困惑是一个误导性的指标。因此，我们介绍了从预训练中得出的新型无监督和监督的代理指标，这些指标成功地将相对性能预测的错误率降低了50％以上。尽管这项任务具有固有的复杂性，但我们在特定方案中展示了我们提出的代理的实际实用性，为更有效设计针对各种下游任务进行了优化的预训练方案铺平了道路。

Title: BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Authors: Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, Amelia Glaese
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12516
Pdf URL: https://arxiv.org/pdf/2504.12516
Copy Paste: [[2504.12516]] BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents(https://arxiv.org/abs/2504.12516)
Keywords: agent
Abstract: We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at this https URL.
摘要：我们提出了BrowseComp，这是一个简单而挑战性的基准，用于衡量代理商浏览网络的能力。 BrowseComp包含1,266个问题，这些问题需要持续浏览Internet，以寻找难以找到的，纠缠的信息。尽管问题很难，但browsecomp既简单易用又易于使用，因为预测的答案在参考答案方面既简短又易于验证。浏览代理的浏览可以看作类似于编程竞争是不完整但有用的编码代理的基准。尽管BrowseComp避开了真正的用户查询分布的挑战，例如产生长答案或解决歧义，但它衡量了在查找信息中行使持久性和创造力的重要核心能力。可以在此HTTPS URL上找到BrowseComp。

Title: Evaluating the Diversity and Quality of LLM Generated Content

Authors: Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12522
Pdf URL: https://arxiv.org/pdf/2504.12522
Copy Paste: [[2504.12522]] Evaluating the Diversity and Quality of LLM Generated Content(https://arxiv.org/abs/2504.12522)
Keywords: language model, llm
Abstract: Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
摘要：最近的工作表明，包括从人类偏好中学习（RLHF）方法（例如PPO和GRPO）以及DPO等替代方案的偏好调整技术造成了难题，鉴于此类模型被广泛地部署在需要多种产出的应用中，因此造成了难题。为了解决这个问题，我们引入了一个框架，用于测量有效的语义多样性 - 在满足质量阈值的输出之间多样性 - 这更好地反映了大语言模型（LLMS）的实际实用性。使用不需要人类干预的开放式任务，我们发现了反直觉的结果：尽管偏好调整的模型（尤其是通过RL培训的模型）可以证明降低的词汇和句法多样性减少了，但它们比SFT或基本模型产生更大的有效语义多样性，而不是来自高品质输出的增加，而是由于增加了高质量的高质量输出。我们发现，偏好调整会降低句法多样性，同时保持语义多样性 - 避免了形式多样性与传统指标经常忽略的内容的多样性之间的区别。我们的分析进一步表明，较小的模型始终在固定采样预算内生成独特的内容，从而提供了对模型缩放和多样性之间关系的见解。这些发现对需要多种而高质量的输出的应用具有重要意义，从创造性援助到合成数据的生成。

Title: Memorization vs. Reasoning: Updating LLMs with New Knowledge

Authors: Aochong Oliver Li, Tanya Goyal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12523
Pdf URL: https://arxiv.org/pdf/2504.12523
Copy Paste: [[2504.12523]] Memorization vs. Reasoning: Updating LLMs with New Knowledge(https://arxiv.org/abs/2504.12523)
Keywords: language model, llm
Abstract: Large language models (LLMs) encode vast amounts of pre-trained knowledge in their parameters, but updating them as real-world information evolves remains a challenge. Existing methodologies and benchmarks primarily target entity substitutions, failing to capture the full breadth of complex real-world dynamics. In this paper, we introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates reflected in an evidence corpora. KUP's evaluation framework includes direct and indirect probes to both test memorization of updated facts and reasoning over them, for any update learning methods. Next, we present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated "memory" tokens during training. Our strategy encourages LLMs to surface and reason over newly memorized knowledge at inference. Our results on two strong LLMs show that (1) KUP benchmark is highly challenging, with the best CPT models achieving $<2\%$ in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines, improving direct probing (memorization) results by up to $25.4\%$.
摘要：大型语言模型（LLMS）在其参数中编码了大量的预训练知识，但是随着现实世界信息的发展，对其进行更新是一个挑战。现有的方法和基准主要是针对实体替代的，未能捕获复杂的现实世界动态的全部广度。在本文中，我们介绍了知识更新游乐场（KUP），这是一种自动管道，用于模拟证据语料库中反映的现实知识更新。 KUP的评估框架包括直接和间接探针，以测试更新事实的记忆以及对它们的推理，以获取任何更新的学习方法。接下来，我们提出一种称为记忆条件训练（MCT）的轻量级方法，该方法在更新语料库中的代币在训练过程中的自我生成的“内存”标记。我们的策略鼓励LLM浮出水面并推理推断新记忆的知识。我们在两个强LLM上的结果表明，（1）KUP基准极具挑战性，最佳CPT模型在间接探测设置（推理）和（2）MCT培训中实现了$ <2 \％$，在继续持续的培训（CPT）基准（CPT）基础线（MEMICICATION）培训（MEMIRICAIGIAL（MEMORIGICARIC）（MEMORIGICARIC）的结果（最高$ 25.4 \％）。

Title: Memorization: A Close Look at Books

Authors: Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12549
Pdf URL: https://arxiv.org/pdf/2504.12549
Copy Paste: [[2504.12549]] Memorization: A Close Look at Books(https://arxiv.org/abs/2504.12549)
Keywords: llm, prompt
Abstract: To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.
摘要：在多大程度上可以从LLM中提取整本书？ Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens.我们还能够在其他几本书上获得高提取率。但是，这些成功并不统一到所有书籍。我们表明，书籍的提取率与书籍的受欢迎程度相关，因此在培训数据中可能重复。在最近的工作之后（Nasr等，2025年），我们还确认了指令调整的Llama 3.1中的缓解作用。我们进一步发现，这种撤消的原因仅来自主要集中在较低变压器块中的一小部分权重的变化。我们的结果提供了当前缓解策略的局限性的证据，并引入了一个框架，用于研究微调如何影响对齐LLMS中逐字记忆的检索。

Title: ELAB: Extensive LLM Alignment Benchmark in Persian Language

Authors: Zahra Pourbahman, Fatemeh Rajabi, Mohammadhossein Sadeghi, Omid Ghahroodi, Somaye Bakhshaei, Arash Amini, Reza Kazemi, Mahdieh Soleymani Baghshah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12553
Pdf URL: https://arxiv.org/pdf/2504.12553
Copy Paste: [[2504.12553]] ELAB: Extensive LLM Alignment Benchmark in Persian Language(https://arxiv.org/abs/2504.12553)
Keywords: language model, llm
Abstract: This paper presents a comprehensive evaluation framework for aligning Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. It addresses the gaps in existing LLM evaluation frameworks by adapting them to Persian linguistic and cultural contexts. This benchmark creates three types of Persian-language benchmarks: (i) translated data, (ii) new data generated synthetically, and (iii) new naturally collected data. We translate Anthropic Red Teaming data, AdvBench, HarmBench, and DecodingTrust into Persian. Furthermore, we create ProhibiBench-fa, SafeBench-fa, FairBench-fa, and SocialBench-fa as new datasets to address harmful and prohibited content in indigenous culture. Moreover, we collect extensive dataset as GuardBench-fa to consider Persian cultural norms. By combining these datasets, our work establishes a unified framework for evaluating Persian LLMs, offering a new approach to culturally grounded alignment evaluation. A systematic evaluation of Persian LLMs is performed across the three alignment aspects: safety (avoiding harmful content), fairness (mitigating biases), and social norms (adhering to culturally accepted behaviors). We present a publicly available leaderboard that benchmarks Persian LLMs with respect to safety, fairness, and social norms at: this https URL.
摘要：本文提出了一个全面的评估框架，该框架对波斯大语模型（LLM）的结盟框架具有关键的道德维度，包括安全，公平和社会规范。它通过将其调整为波斯语言和文化背景来解决现有LLM评估框架中的差距。该基准测试创建了三种类型的波斯语基准：（i）翻译数据，（ii）合成生成的新数据以及（iii）新的自然收集的数据。我们将拟人化的红色团队数据，Advbench，Harmbench和解码器转化为波斯语。此外，我们创建了禁止的FA，SafeBench-FA，Fairbench-FA和SocialBench-FA，作为新数据集，以解决土著文化中的有害和禁止的内容。此外，我们将广泛的数据集作为后卫FA收集，以考虑波斯文化规范。通过结合这些数据集，我们的工作建立了一个统一的评估波斯LLM的框架，为文化扎根的一致性评估提供了新的方法。在三个对齐方面进行了对波斯语LLM的系统评估：安全性（避免有害内容），公平性（减轻偏见）和社会规范（坚持文化接受的行为）。我们提出了一个公开可用的排行榜，该排行榜基准在以下位置对安全性，公平性和社会规范进行基于波斯语LLM。

Title: CDF-RAG: Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation

Authors: Elahe Khatibi, Ziyu Wang, Amir M. Rahmani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12560
Pdf URL: https://arxiv.org/pdf/2504.12560
Copy Paste: [[2504.12560]] CDF-RAG: Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation(https://arxiv.org/abs/2504.12560)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has significantly enhanced large language models (LLMs) in knowledge-intensive tasks by incorporating external knowledge retrieval. However, existing RAG frameworks primarily rely on semantic similarity and correlation-driven retrieval, limiting their ability to distinguish true causal relationships from spurious associations. This results in responses that may be factually grounded but fail to establish cause-and-effect mechanisms, leading to incomplete or misleading insights. To address this issue, we introduce Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG), a framework designed to improve causal consistency, factual accuracy, and explainability in generative reasoning. CDF-RAG iteratively refines queries, retrieves structured causal graphs, and enables multi-hop causal reasoning across interconnected knowledge sources. Additionally, it validates responses against causal pathways, ensuring logically coherent and factually grounded outputs. We evaluate CDF-RAG on four diverse datasets, demonstrating its ability to improve response accuracy and causal correctness over existing RAG-based methods. Our code is publicly available at this https URL elakhatibi/CDF-RAG.
摘要：通过合并外部知识检索，检索授权的生成（RAG）在知识密集型任务中具有显着增强的大型语言模型（LLM）。但是，现有的RAG框架主要依赖于语义相似性和相关性驱动的检索，从而限制了它们将真正因果关系与虚假关联区分开的能力。这导致了可能是基于事实基础但无法建立因果机制的反应，从而导致不完整或误导性的见解。为了解决这个问题，我们引入了自适应检索生成生成（CDF-rag）的因果动态反馈，该框架旨在提高因果一致性，事实准确性和生成推理中的解释性。 CDF斜视迭代地完善了查询，检索结构化的因果图，并在互连的知识源中启用多跳的因果推理。此外，它验证了针对因果途径的响应，以确保逻辑上相干和实际扎根的产出。我们在四个不同的数据集上评估了CDF-rag，这表明了其提高基于抹布的方法的响应准确性和因果正确性的能力。我们的代码可在此HTTPS URL Elakhatibi/CDF-rag上公开获得。

Title: MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Authors: Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12563
Pdf URL: https://arxiv.org/pdf/2504.12563
Copy Paste: [[2504.12563]] MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation(https://arxiv.org/abs/2504.12563)
Keywords: language model, llm, prompt, agent
Abstract: Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.
摘要：最近的较小语言模型，例如PHI-3.5和PHI-4，依赖于使用较大语言模型生成的合成数据。关于利用合成数据的其他用例，例如将LLMS适应特定域的问题仍然存在。合成数据的关键局限性是低多样性，这对其下游适用于改善其他模型的适用性产生负面影响。为了解决这个问题，我们提出了MetAsynth，这是一种生成合成数据的方法，可以通过元数据来增强多样性，其中语言模型策划了多个“专家” LLM LLM代理以协作生成数据。我们仅使用Metasynth产生的2500万个令牌的综合数据，我们成功地适应了训练有素的LLM（Mistral-7b-V0.3），以适应两个专用域，并在一般任务中损害所得模型的能力。此外，我们使用七个自动化指标评估了合成数据的多样性，并发现它接近LLM培训前语料库的多样性。持续训练的Mistral-7b-v0.3具有Metasynth的表现尤其优于LLM，表明金融的提高高达4.08％，生物医学的提高了13.75％。同一模型在使用模板提示的数据进行培训时，表明了降级性能，即使模板包含前几代和实际数据的内在范例有所不同。我们的发现表明，使用MetAsynth时，有几百万个不同的合成数据代币就足以适应有效的域。

Title: Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

Authors: Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12585
Pdf URL: https://arxiv.org/pdf/2504.12585
Copy Paste: [[2504.12585]] Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models(https://arxiv.org/abs/2504.12585)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.
摘要：大型语言模型（LLM）有时无法对确定性任务（例如计数或形成缩写词）做出适当的响应，因为他们对代币序列学到的隐含先前分布会影响他们的响应。在这项工作中，我们表明，至少在某些情况下，LLMS实际上计算了正确执行这些任务所需的信息，并且我们确定了一些干预措施，这些干预措施可以允许他们访问此信息以提高其性能。首先，我们表明，简单地提示语言模型不依赖其先验知识会导致先前主导的任务的巨大改进。然后，我们使用机械性解释性技术来将LLM中的先验定位并操纵先验影响其响应的程度。具体而言，我们表明，可以识别与响应的先前概率相关的基础神经网络的层，并且这些层对这些层的轻巧登录具有基本的提示，可以在先前主导的任务上实现高性能。这些结果表明，产生正确响应所需的信息包含在模型形成的问题的表示中。此外，我们表明这种填充对于先前主导的任务更有效，并且固定后的错误不再与先验相关。我们的结果表明，有可能定义有效的方法来操纵LLMS在解决问题方面依赖其先验的程度，从而在LLMS幻觉中出于与代币序列的先前概率相关的原因而在LLMS幻觉中提高其性能。

Title: GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

Authors: Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12597
Pdf URL: https://arxiv.org/pdf/2504.12597
Copy Paste: [[2504.12597]] GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning(https://arxiv.org/abs/2504.12597)
Keywords: language model, llm
Abstract: Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
摘要：几何问题解决（GPS）是一项具有挑战性的任务，需要视觉理解和象征性推理，有效地衡量了多模式大语言模型（MLLM）的推理能力。人类通过在视觉环境中准确识别和自适应应用几何原理，在此任务中表现出强大的推理能力。但是，现有的基准无法共同评估MLLM中类似人类的几何推理机制的两个维度，在评估其处理GPS的能力方面仍然是一个危险的差距。为此，我们介绍了GeoSense，这是第一个全面的双语基准测试，旨在系统地评估MLLM的几何原理的几何推理能力。 GeoSense具有跨越平面和固体几何形状的几何原理的五级分层框架，一个复杂的注释数据集的1,789个问题以及创新的评估策略。通过各种开源和封闭源MLLM的Geosense进行广泛的实验，我们观察到Gemini-2.0-Pro-Flash的表现最佳，总分达到65.3美元。我们的深入分析表明，几何原理的识别和应用仍然是领先MLLM的瓶颈，共同阻碍了其推理能力。这些发现强调了Geosense在MLLM的几何推理能力中指导未来进步的潜力，为在人工智能中更加坚固和类似人类的推理铺平了道路。

Title: Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs

Authors: Younghun Lee, Dan Goldwasser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12633
Pdf URL: https://arxiv.org/pdf/2504.12633
Copy Paste: [[2504.12633]] Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs(https://arxiv.org/abs/2504.12633)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision making. Existing studies suggest that LLM generations can be subjectively grounded to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SOLAR (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results show that our framework improves overall inference results as well as performance on controversial situations. Additionally, we qualitatively show that SOLAR provides explanations about individuals' value preferences, which can further account for their judgments.
摘要：大型语言模型（LLMS）不仅解决了复杂的推理问题，而且在需要主观决策的任务中表现出色。现有的研究表明，LLM世代可以在某种程度上主观基础，但探索LLM是否可以考虑个人级别的主观性，尚未得到充分研究。在本文中，我们表征了个人在社交媒体上的主观性，并使用LLMS推断他们的道德判断。我们提出了一个框架，太阳能（带有价值抽象的主观基础），该框架观察到用户生成的文本中的价值冲突和权衡，以更好地代表个人的主观基础。经验结果表明，我们的框架改善了有争议情况的总体推理结果以及表现。此外，我们定性地表明，太阳能提供了有关个人价值偏好的解释，这可以进一步考虑其判断。

Title: Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12637
Pdf URL: https://arxiv.org/pdf/2504.12637
Copy Paste: [[2504.12637]] Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation(https://arxiv.org/abs/2504.12637)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.
摘要：大型语言模型（LLMS）与长篇小说推理斗争，不仅是由于计算复杂性的二次缩放，而且还因为稀缺性和注释长篇小说数据的稀缺性和费用。几乎没有任何有系统地烧掉长篇小说数据的开源工作，也没有任何公开可用的指令调谐数据集，上下文超过100k令牌。为了弥合这一差距，我们引入了一种新颖的培训后合成数据生成策略，旨在有效地扩展LLM的上下文窗口，同时保留其一般任务绩效。我们的方法可扩展到任意长的上下文长度，这不受可用现实世界数据的长度的影响，这有效地解决了原始长篇小说数据的稀缺性。通过分步旋转位置嵌入（绳索）缩放训练策略，我们证明，我们的模型具有高达1M令牌的上下文长度，在标尺基准和InfiniteBench上表现良好，并在一般语言任务上保持了强劲的性能。

Title: Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment

Authors: Xiaotian Zhang, Ruizhe Chen, Yang Feng, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12663
Pdf URL: https://arxiv.org/pdf/2504.12663
Copy Paste: [[2504.12663]] Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment(https://arxiv.org/abs/2504.12663)
Keywords: language model
Abstract: Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment.
摘要：将语言模型与人类偏好保持一致，提出了重大挑战，尤其是在实现个性化的情况下而不会产生过多的计算成本。现有方法依赖于奖励信号和其他带注释的数据，从而限制了它们对各种人类价值观的可扩展性和适应性。为了应对这些挑战，我们介绍了角色法官，这是一种新颖的歧视性范式，可以使无培训的个性化对齐方式具有看不见的偏好。人物法官没有通过外部奖励反馈来优化策略参数，而是利用模型的内在偏好判断能力。具体而言，草稿模型生成以给定偏好为条件的候选令牌，而法官模型则体现了另一种偏好，跨验证了预测的令牌是否被接受。实验结果表明，使用模型的固有偏好评估机制，角色 - 法官为个性化对齐提供了可扩展且有效的效率解决方案，为更自适应的定制对准铺平了道路。

Title: ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

Authors: Singon Kim, Gunho Jung, Seong-Whan Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12673
Pdf URL: https://arxiv.org/pdf/2504.12673
Copy Paste: [[2504.12673]] ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models(https://arxiv.org/abs/2504.12673)
Keywords: language model, long context, retrieval-augmented generation
Abstract: Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.
摘要：抽象的压缩利用较小的Langauge模型来凝结与查询相关的环境，从而降低了检索增强发电（RAG）的计算成本。但是，检索的文件通常包含与回答查询无关的信息，或者由于事实不正确的内容而引起的误导性，尽管相关性得分很高。这种行为表明，抽象压缩机更有可能忽略正确答案所必需的重要信息，尤其是在发生注意力分散的长篇小说中。为了解决这个问题，我们以更细粒度的方式对检索到的文档进行了分类，并提出了针对噪声（ACORN）的抽象压缩，这引入了两个新颖的训练步骤。首先，我们使用训练数据集上的离线数据扩展来增强压缩机的鲁棒性，以针对两种不同类型的检索噪声增强。其次，由于基于语言模型的压缩机无法从多个检索的文档中充分利用信息，因此我们执行填充以生成围绕直接支持正确答案的关键信息的摘要。我们的实验表明，用橡子作为压缩机训练的T5大型可以改善EM和F1分数，同时保留答案字符串，这可以作为直接证据。 Acorn在数据集上具有许多精确降低的文档，使其在现实情况下非常有用。

Title: GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs

Authors: Kun-Woo Kim, Ji-Hoon Park, Ju-Min Han, Seong-Whan Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12681
Pdf URL: https://arxiv.org/pdf/2504.12681
Copy Paste: [[2504.12681]] GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs(https://arxiv.org/abs/2504.12681)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) trained on extensive datasets often learn sensitive information, which raises significant social and legal concerns under principles such as the "Right to be forgotten." Retraining entire models from scratch to remove undesired information is both costly and impractical. Furthermore, existing single-domain unlearning methods fail to address multi-domain scenarios, where knowledge is interwoven across domains such as privacy and copyright, creating overlapping representations that lead to excessive knowledge removal or degraded performance. To tackle these issues, we propose GRAIL (GRadient-based AdaptIve unLearning), a novel multi-domain unlearning framework. GRAIL leverages gradient information from multiple domains to precisely distinguish the unlearning scope from the retention scope, and applies an adaptive parameter-wise localization strategy to selectively remove targeted knowledge while preserving critical parameters for each domain. Experimental results on unlearning benchmarks show that GRAIL achieves unlearning success on par with the existing approaches, while also demonstrating up to 17% stronger knowledge retention success compared to the previous state-of-art method. Our findings establish a new paradigm for effectively managing and regulating sensitive information in large-scale pre-trained language models.
摘要：在广泛的数据集中培训的大型语言模型（LLM）经常学习敏感信息，这在“被遗忘的权利”之类的原则下引起了重大的社会和法律问题。从头开始审查整个模型以删除不希望的信息既昂贵又不切实际。此外，现有的单域学习方法无法解决多域情景，在这种情况下，知识在隐私和版权等领域之间交织在一起，从而创建了重叠的表示，从而导致过度删除知识或降级性能。为了解决这些问题，我们提出了一种新型的多域遗产框架（基于梯度的自适应 - 基于梯度的自适应划分）。 Grail利用来自多个域的梯度信息可以精确区分未学习范围与保留范围，并应用自适应参数定位策略，以选择性地删除目标知识，同时保留每个域的关键参数。关于学习基准的实验结果表明，与现有方法相比，Grail在与现有方法的同等方面取得了成功，同时也证明了与先前的最新方法相比，知识保留成功率高17％。我们的发现建立了一个新的范式，用于在大规模的预训练语言模型中有效管理和调节敏感信息。

Title: Data-efficient LLM Fine-tuning for Code Generation

Authors: Weijie Lv, Xuan Xia, Sheng-Jun Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12687
Pdf URL: https://arxiv.org/pdf/2504.12687
Copy Paste: [[2504.12687]] Data-efficient LLM Fine-tuning for Code Generation(https://arxiv.org/abs/2504.12687)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a "dynamic pack" technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.
摘要：大型语言模型（LLM）在代码生成任务中表现出了巨大的潜力。但是，开源和封闭源模型之间仍然存在性能差距。为了解决这一差距，现有方法通常会生成大量的合成数据进行微调，这通常会导致效率低下的培训。在这项工作中，我们提出了一个数据选择策略，以提高基于代码的LLM的培训的有效性和效率。通过确定数据复杂性并确保采样子集与原始数据集的分布保持一致，我们的采样策略有效地选择了高质量的数据。此外，我们通过“动态包”技术优化令牌化过程，该技术可最大程度地减少填充令牌并减少计算资源消耗。实验结果表明，当对40％OSS-Instruct数据集进行训练时，DeepSeek-Coder-Base-6.7b模型的平均性能达到66.9％，超过了完整数据集的66.1％的性能。此外，训练时间从47分钟减少到34分钟，在单个时期内，GPU的峰值记忆从61.47 GB降低到42.72 GB。 EVOL教学数据集上的Codellama-Python-7b模型也观察到了类似的改进。通过优化数据选择和令牌化，我们的方法不仅可以提高模型性能，还可以提高培训效率。

Title: Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations

Authors: Yiyou Sun, Yu Gai, Lijie Chen, Abhilasha Ravichander, Yejin Choi, Dawn Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12691
Pdf URL: https://arxiv.org/pdf/2504.12691
Copy Paste: [[2504.12691]] Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations(https://arxiv.org/abs/2504.12691)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model's training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.
摘要：大型语言模型（LLMS）经常产生幻觉 - 偏离事实准确性或由于基本原因的复杂相互作用而为诊断提供了上下文挑战。本文介绍了一个子序列关联框架，以系统地追踪和理解幻觉。我们的关键见解是，当主导的幻觉协会超过忠实的幻觉协会时，就会出现幻觉。通过理论和经验分析，我们证明了仅解码器的变压器有效地充当子序列嵌入模型，线性层编码输入输出相关。我们提出了一种通过分析随机输入上下文中的幻觉概率来识别因果子序列的追踪算法。实验表明，我们的方法在识别幻觉原因并与模型训练语料库的证据相处融洽时优于标准归因技术。这项工作提供了有关幻觉的统一观点，并为其追踪和分析提供了强大的框架。

Title: Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Authors: Yongrui Chen, Junhao He, Linbo Fu, Shenyu Zhang, Rihui Jin, Xinbang Dai, Jiaqi Li, Dehai Min, Nan Hu, Yuxin Zhang, Guilin Qi, Yi Huang, Tongtong Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12734
Pdf URL: https://arxiv.org/pdf/2504.12734
Copy Paste: [[2504.12734]] Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge(https://arxiv.org/abs/2504.12734)
Keywords: language model, llm, agent
Abstract: Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textsc{Pandora}, which takes advantage of \textsc{Python}'s \textsc{Pandas} API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified frameworks and competes effectively with task-specific methods.
摘要：统一的结构化知识推理（USKR）旨在通过以统一的方式使用结构化来源（例如表，数据库和知识图）来回答自然语言问题（NLQ）。现有的USKR方法要么依赖于采用特定任务的策略或定义定义的表示，这些策略努力利用不同的SKR任务之间的知识转移或与LLM的先验相一致，从而限制了其绩效。本文提出了一个名为\ textsc {pandora}的新颖的USKR框架，该框架利用\ textsc {python}'s \ textsc {pandas} api构建统一的知识表示，以与LLM预先培训相结合。它采用LLM来为每个问题生成文本推理步骤和可执行的Python代码。演示是从涵盖各种SKR任务的培训示例的记忆中得出的，从而促进了知识转移。涉及三个SKR任务的四个基准测试的广泛实验表明，\ textsc {pandora}的表现优于现有的统一框架，并与特定于任务的方法有效竞争。

Title: Chinese-Vicuna: A Chinese Instruction-following Llama-based Model

Authors: Chenghao Fan, Zhenyi Lu, Jie Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12737
Pdf URL: https://arxiv.org/pdf/2504.12737
Copy Paste: [[2504.12737]] Chinese-Vicuna: A Chinese Instruction-following Llama-based Model(https://arxiv.org/abs/2504.12737)
Keywords: language model, llm
Abstract: Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta's LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healthcare and law. By integrating hybrid datasets (BELLE and Guanaco) and 4-bit quantization (QLoRA), the model achieves competitive performance in tasks such as translation, code generation, and domain-specific Q\&A. The project provides a comprehensive toolkit for model conversion, CPU inference, and multi-turn dialogue interfaces, emphasizing accessibility for researchers and developers. Evaluations indicate competitive performance across medical tasks, multi-turn dialogue coherence, and real-time legal updates. Chinese-Vicuna's modular design, open-source ecosystem, and community-driven enhancements position it as a versatile foundation for Chinese LLM applications.
摘要：中国磁盘是一种开源的，资源有效的语言模型，旨在通过使用低级适应（LORA）微调Meta的Llama架构来弥合中文指导遵循的功能的差距。针对低资源环境，它可以对消费者GPU（例如7B型号的RTX-2080TI）进行具有成本效益的部署，并支持医疗保健和法律等领域的特定领域适应。通过集成混合数据集（Belle和Guanaco）和4位量化（Qlora），该模型可以在诸如翻译，代码生成和特定领域的Q \＆a之类的任务中实现竞争性能。该项目为模型转换，CPU推理和多转化对话接口提供了全面的工具包，强调了研究人员和开发人员的可访问性。评估表明跨医疗任务，多转化对话连贯性和实时法律更新的竞争性能。中国磁盘的模块化设计，开源生态系统和社区驱动的增强功能将其定位为中国LLM应用的多功能基础。

Title: Out of Sight Out of Mind, Out of Sight Out of Mind: Measuring Bias in Language Models Against Overlooked Marginalized Groups in Regional Contexts

Authors: Fatma Elsafoury, David Hartmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12767
Pdf URL: https://arxiv.org/pdf/2504.12767
Copy Paste: [[2504.12767]] Out of Sight Out of Mind, Out of Sight Out of Mind: Measuring Bias in Language Models Against Overlooked Marginalized Groups in Regional Contexts(https://arxiv.org/abs/2504.12767)
Keywords: language model
Abstract: We know that language models (LMs) form biases and stereotypes of minorities, leading to unfair treatments of members of these groups, thanks to research mainly in the US and the broader English-speaking world. As the negative behavior of these models has severe consequences for society and individuals, industry and academia are actively developing methods to reduce the bias in LMs. However, there are many under-represented groups and languages that have been overlooked so far. This includes marginalized groups that are specific to individual countries and regions in the English speaking and Western world, but crucially also almost all marginalized groups in the rest of the world. The UN estimates, that between 600 million to 1.2 billion people worldwide are members of marginalized groups and in need for special protection. If we want to develop inclusive LMs that work for everyone, we have to broaden our understanding to include overlooked marginalized groups and low-resource languages and dialects. In this work, we contribute to this effort with the first study investigating offensive stereotyping bias in 23 LMs for 270 marginalized groups from Egypt, the remaining 21 Arab countries, Germany, the UK, and the US. Additionally, we investigate the impact of low-resource languages and dialects on the study of bias in LMs, demonstrating the limitations of current bias metrics, as we measure significantly higher bias when using the Egyptian Arabic dialect versus Modern Standard Arabic. Our results show, LMs indeed show higher bias against many marginalized groups in comparison to dominant groups. However, this is not the case for Arabic LMs, where the bias is high against both marginalized and dominant groups in relation to religion and ethnicity. Our results also show higher intersectional bias against Non-binary, LGBTQIA+ and Black women.
摘要：我们知道，由于主要在美国和更广泛的英语世界的研究，语言模型（LMS）形成了少数群体的偏见和刻板印象，从而导致了这些群体成员的不公平治疗。由于这些模型的负面行为对社会和个人产生了严重的后果，因此行业和学术界正在积极开发减少LMS偏见的方法。但是，到目前为止，有许多代表性不足的群体和语言被忽略了。这包括针对英语和西方世界中各个国家和地区的边缘化群体，但至关重要的是世界其他地区几乎所有边缘化的群体。联合国估计，全球6亿至12亿人口是边缘化团体的成员，需要特别保护。如果我们想开发适合所有人有用的包容性LM，我们必须扩大我们的理解，以包括被忽视的边缘化群体和低资源语言和方言。在这项工作中，我们为这项努力做出了贡献，首次研究调查了23 LMS的进攻刻板印象偏见，该研究来自埃及，其余21个阿拉伯国家，德国，英国和美国的270个边缘化群体。此外，我们研究了低资源语言和方言对LMS偏见研究的影响，证明了当前偏见指标的局限性，因为我们在使用埃及阿拉伯方言与现代标准阿拉伯语时测量了明显更高的偏见。我们的结果表明，与主导群体相比，LMS确实显示出对许多边缘化群体的更高偏见。但是，阿拉伯语LMS并非如此，在这种情况下，与宗教和种族有关的边缘化和主导群体的偏见很高。我们的结果还表明了对非二元，LGBTQIA+和黑人妇女的较高偏见。

Title: Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration

Authors: Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12773
Pdf URL: https://arxiv.org/pdf/2504.12773
Copy Paste: [[2504.12773]] Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration(https://arxiv.org/abs/2504.12773)
Keywords: language model, llm, hallucination
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, \textbf{GeoGen} produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train \textbf{GeoLogic}, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at this https URL.
摘要：多模式大语言模型（MLLM）的最新进展在一般领域取得了显着的进步，并在多模式数学推理中表现出了希望。但是，由于缺乏准确的逐步解决方案数据和推理过程中严重的幻觉，将MLLM应用于几何问题解决（GPS）仍然具有挑战性。在本文中，我们提出了一个可以自动生成几何图的逐步推理路径的管道。通过利用精确的符号推理，\ textbf {geogen}产生大规模的高质量提问对。为了进一步提高MLLM的逻辑推理能力，我们使用Geogen生成的合成数据训练\ TextBf {地质模型（LLM）。作为自然语言和符号系统之间的桥梁，地质学使符号工具可以帮助验证MLLM输出，从而使推理过程更加严格和减轻幻觉。实验结果表明，我们的方法一致地提高了MLLM的性能，从而在几何推理任务的基准上取得了显着的结果。这种改进源于我们整合了LLM和符号系统的优势，这为GPS任务提供了一种更可靠和可解释的方法。代码可在此HTTPS URL上找到。

Title: Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

Authors: Takaya Arita, Wenxian Zheng, Reiji Suzuki, Fuminori Akiba
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2504.12805
Pdf URL: https://arxiv.org/pdf/2504.12805
Copy Paste: [[2504.12805]] Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation(https://arxiv.org/abs/2504.12805)
Keywords: language model, llm, prompt
Abstract: This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox--the idea that LLMs can produce expert-like output without genuine understanding--they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.
摘要：这项研究探讨了大型语言模型（LLM）在与艺术有关的两个领域中的表现：在与艺术相关的情况下，对艺术品的批评以及关于心理状态（心理理论或TOM）的推理。对于批评生成部分，我们建立了一个系统，将Noel Carroll的评估框架与广泛的艺术批评理论相结合。提示该模型首先写出全长的批评，然后使用逐步提示的过程更短，更连贯的版本。然后将这些AI生成的批评与人类专家在图灵测试式评估中撰写的批评进行了比较。在许多情况下，人类受试者很难判断哪个是哪个，结果表明，LLM可以产生不仅在风格上合理的批评，而且只要仔细地引导它们就可以富裕，而且也有丰富的解释。在第二部分中，我们根据涉及解释，情感和道德张力的情况引入了新的简单TOM任务，这些任务可能会出现在艺术的背景下。这些超出了标准的假质测试，并允许更复杂的社会嵌入推理形式。我们测试了41个LLMS，发现它们的性能在任务和模型中各不相同。特别是，涉及情感或歧义情况的任务往往揭示了更明显的差异。综上所述，这些结果有助于阐明LLM应对复杂的解释性挑战的反应，从而揭示了它们的认知局限性和潜力。尽管我们的发现并没有直接与所谓的生成AI悖论相矛盾 - LLM可以在不真正理解的情况下产生类似专家的输出的想法 - 他们表明，取决于通过精心设计的提示（例如，这些模型）可能开始表现出比我们可能比我们想象更紧密地理解的行为。

Title: Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Authors: Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12845
Pdf URL: https://arxiv.org/pdf/2504.12845
Copy Paste: [[2504.12845]] Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks(https://arxiv.org/abs/2504.12845)
Keywords: language model, llm, retrieval augmented generation
Abstract: Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model's capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.
摘要：现有的多语言长篇小写基准通常基于流行的 - 海豹测试，主要评估模型的能力，可以找到埋在无关的文本中的特定信息。但是，这样一种以检索为中心的方法是近视和天生有限的，因为仅成功的召回并不能表明模型在扩展上下文上推理的能力。此外，这些基准易受数据泄漏，短路和风险的影响，使评估是可识别的。为了解决这些局限性，我们引入了MLRBENCH，这是一种用于多语言长篇下说推理的新合成基准。与现有的基准不同，MLRBENCH通过包括评估多跳推断，聚集和认知推理的任务超出了表面水平的检索。跨越七种语言，MLRBENCH设计为平行，可抵抗泄漏，可扩展到任意上下文长度。我们对开放式大语言模型（LLM）进行的广泛实验揭示了高资源和低资源语言之间的明显差距，尤其是对于要求该模型汇总多个事实或预测缺乏信息的任务。我们还发现，在多语言设置中，LLMS有效地利用了其声称的上下文长度的30％。尽管现成的检索增强发电有助于在一定程度上减轻这种情况，但它并不能解决长篇小说问题。我们开源MLRBENCH，以改善对多语言LLM的评估和培训的未来研究。

Title: ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos

Authors: Patrick Giedemann, Pius von Däniken, Jan Deriu, Alvaro Rodrigo, Anselmo Peñas, Mark Cieliebak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12882
Pdf URL: https://arxiv.org/pdf/2504.12882
Copy Paste: [[2504.12882]] ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos(https://arxiv.org/abs/2504.12882)
Keywords: language model
Abstract: The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.
摘要：视频内容作为通信和错误信息的媒介的日益增长的影响强调了迫切需要有效的工具来分析多语言和多主题环境中的主张。现有的错误信息检测的努力主要集中在书面文本上，在解决视频成绩单中口语文本的复杂性方面留下了很大的差距。我们介绍了Vicaim，这是一个围绕三种语言（英语，德语，西班牙语）和六个主题的1,798个注释视频笔录的数据集。成绩单中的每个句子都标有三个与索赔相关的类别：值得核对事实，事实核对或意见。我们开发了一种自定义注释工具来促进高度复杂的注释过程。使用最先进的多语言语言模型的实验表明，交叉验证的性能很强（宏F1高达0.896），但揭示了对看不见的主题的概括挑战，尤其是对于不同的领域。我们的发现突出了视频成绩单中索赔检测的复杂性。 Vicaim为在基于视频的沟通中推进错误信息检测提供了强大的基础，并解决了多模式分析中的关键差距。

Title: Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication

Authors: Vicent Briva-Iglesias
Subjects: cs.CL, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2504.12891
Pdf URL: https://arxiv.org/pdf/2504.12891
Copy Paste: [[2504.12891]] Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication(https://arxiv.org/abs/2504.12891)
Keywords: agent
Abstract: The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.
摘要：人工智能（AI）的快速演变已引入了AI代理作为各个行业的破坏性范式，但它们在机器翻译（MT）中的应用仍未得到充实。本文介绍并分析了MT单一和多代理系统的潜力，反映了它们如何增强多语言数字通信。虽然单一机构系统非常适合简单的翻译任务，但多代理系统涉及以结构化方式协作的多个专业的AI代理，可能为需要高精度，特定于领域的知识和上下文意识的复杂场景提供了有希望的解决方案。为了证明MT多代理工作流的可行性，我们正在法律MT进行一项试点研究。该研究采用了一个多代理系统，涉及四种专门的AI代理，用于（i）翻译，（ii）充分审查，（iii）流利度审查和（iv）最终编辑。我们的发现表明，多代理系统可能有可能显着提高领域适应性和上下文意识，并具有卓越的翻译质量，比传统的MT或单一代理系统。本文还为未来对MT多代理应用程序的研究，集成到专业翻译工作流程中的阶段奠定了基础，并分享了本文中分析的系统的演示。

Title: Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models

Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12898
Pdf URL: https://arxiv.org/pdf/2504.12898
Copy Paste: [[2504.12898]] Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models(https://arxiv.org/abs/2504.12898)
Keywords: language model, llm
Abstract: Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.
摘要：尽管取得了重大进展，但最近的研究表明，当前的大型语言模型（LLM）仍可能捕获数据集偏见并在推断过程中利用它们，从而导致LLM的普遍性差。但是，由于数据集偏见的多样性以及基于文本学习的偏见抑制性质不足，因此先前基于知识的偏见方法和基于秘密学习的自动偏见方法的有效性是有限的。为了应对这些挑战，我们探讨了因果机制与信息理论的结合，并提出了信息引导的因果干预措施（IGCIDB）框架。该框架首先利用信息引导的因果干预方法自动和自主平衡指令调整数据集的分布。随后，它采用标准监督的微调过程来培训DELIAS的LLM。实验结果表明，IGCIDB可以有效地提高Debias LLM，以提高其在不同任务中的普遍性。

Title: Benchmarking Multi-National Value Alignment for Large Language Models

Authors: Chengyi Ju, Weijie Shi, Chengzhong Liu, Jiaming Ji, Jipeng Zhang, Ruiyuan Zhang, Jia Zhu, Jiajie Xu, Yaodong Yang, Sirui Han, Yike Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12911
Pdf URL: https://arxiv.org/pdf/2504.12911
Copy Paste: [[2504.12911]] Benchmarking Multi-National Value Alignment for Large Language Models(https://arxiv.org/abs/2504.12911)
Keywords: language model, llm
Abstract: Do Large Language Models (LLMs) hold positions that conflict with your country's values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting this http URL conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs' values with the target country.
摘要：大型语言模型（LLMS）是否拥有与您国家价值观冲突的立场？偶尔他们这样做！但是，现有的作品主要集中在道德评论上，未能捕捉到国家价值观的多样性，这些国家价值观包括更广泛的政策，法律和道德考虑。此外，使用手动设计问卷的频谱测试的当前基准不容易扩展。为了解决这些限制，我们介绍了Navab，这是一种综合基准，旨在评估LLM与五个主要国家的价值：中国，美国，英国，法国和德国的价值。 Navab实现了国家价值提取管道，以有效地构建价值评估数据集。具体来说，我们提出了一个具有指令标签的建模过程，以处理原始数据源，筛选过程，以过滤价值相关的主题以及具有减少冲突机制的生成过程，以过滤无冲突的这种HTTP URL对各个国家 /地区各种LLMS进行广泛实验，以及结果提供了洞察力，以帮助识别差异差异的情况。此外，我们证明了Navab可以与对齐技术结合使用，以通过使LLMS的价值与目标国家保持一致，从而有效地减少价值关注。

Title: MAIN: Mutual Alignment Is Necessary for instruction tuning

Authors: Fanyi Yang, Jianfeng Liu, Xin Zhang, Haoyu Liu, Xixin Cao, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12913
Pdf URL: https://arxiv.org/pdf/2504.12913
Copy Paste: [[2504.12913]] MAIN: Mutual Alignment Is Necessary for instruction tuning(https://arxiv.org/abs/2504.12913)
Keywords: language model, llm
Abstract: Instruction tuning has enabled large language models (LLMs) to achieve remarkable performance, but its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that high-quality instruction-response pairs are not defined by the individual quality of each component, but by the extent of their alignment with each other. To address this, we propose a Mutual Alignment Framework (MAIN) that ensures coherence between the instruction and response through mutual constraints. Experiments demonstrate that models such as LLaMA and Mistral, fine-tuned within this framework, outperform traditional methods across multiple benchmarks. This approach underscores the critical role of instruction-response alignment in enabling scalable and high-quality instruction tuning for LLMs.
摘要：指令调整使大型语言模型（LLM）能够实现出色的性能，但其成功在很大程度上取决于大规模，高质量的指令 - 响应对的可用性。但是，扩大数据生成的当前方法通常会忽略一个关键方面：指令和响应之间的对齐方式。我们假设高质量的指令 - 响应对不是由每个组件的个体质量而定义的，而是由它们相互对齐的程度。为了解决这个问题，我们提出了一个相互对准框架（主要），该框架通过相互约束确保指导和响应之间的连贯性。实验表明，在此框架内进行了微调，诸如Llama和Mistral之类的模型都超过了多个基准测试的传统方法。这种方法强调了指导 - 响应一致性在启用LLMS可扩展和高质量指令调整中的关键作用。

Title: ConExion: Concept Extraction with Large Language Models

Authors: Ebrahim Norouzi, Sven Hertling, Harald Sack
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.12915
Pdf URL: https://arxiv.org/pdf/2504.12915
Copy Paste: [[2504.12915]] ConExion: Concept Extraction with Large Language Models(https://arxiv.org/abs/2504.12915)
Keywords: language model, llm, prompt
Abstract: In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at this https URL.
摘要：在本文中，提出了一种使用预训练的大语言模型（LLM）从文档中提取概念的方法。与提取键形的传统方法相比，总结了文档中讨论的重要信息，我们的方法解决了提取与特定领域相关的所有当前概念的更具挑战性的任务，而不仅仅是重要的域。通过对两个广泛使用基准数据集的全面评估，我们证明了我们的方法与最新技术相比提高了F1分数。此外，我们探讨了在这些模型中使用提示进行无监督概念提取的潜力。提取的概念旨在支持本体论的领域覆盖范围评估并促进本体学习，从而强调了LLM在概念提取任务中的有效性。我们的源代码和数据集可在此HTTPS URL上公开可用。

Title: Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Authors: Nearchos Potamitis, Akhil Arora
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12951
Pdf URL: https://arxiv.org/pdf/2504.12951
Copy Paste: [[2504.12951]] Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback(https://arxiv.org/abs/2504.12951)
Keywords: language model, llm, prompt, agent
Abstract: Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?
摘要：大语言模型（LLM）的最新进步促进了通用自主剂的发展，在各个领域的复杂推理任务中表现出了显着的性能。这次激增刺激了许多迅速的推理框架的演变。最近的重点是迭代推理策略，该策略通过自我评估和口头反馈来完善产出。但是，这些策略需要额外的计算复杂性，以使模型能够识别和纠正其错误，从而大大增加成本。在这项工作中，我们介绍了``无反馈的重试''的概念，这是一种令人尴尬的简单而强大的机制，可通过允许LLMS重试解决错误的答案来提高推理框架。与常规的迭代改进方法不同，我们的方法不需要明确的自我反省或言语反馈，从而简化了改进过程。我们的发现表明，更简单的基于重试的方法通常比更复杂的推理框架的表现通常超过更复杂的推理框架，这表明复杂方法的好处可能并不总是证明其计算成本合理。通过挑战普遍的假设，即更复杂的推理策略固有地带来了更好的绩效，我们的工作为更简单，更有效的方法如何实现最佳结果提供了新的见解。那么，您需要重新审查吗？

Title: Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization

Authors: Adithya Pratapa, Teruko Mitamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12972
Pdf URL: https://arxiv.org/pdf/2504.12972
Copy Paste: [[2504.12972]] Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization(https://arxiv.org/abs/2504.12972)
Keywords: language model, llm
Abstract: Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.
摘要：语言模型的长篇文章推理能力的最新进展导致了大规模多文章摘要中有趣的应用。但是，先前的工作表明，这些长篇小说模型在其声称的上下文窗口中无效。为此，检索调制系统提供了一种有效有效的替代方案。但是，它们的性能可以对选择上下文长度的选择高度敏感。在这项工作中，我们提出了一种混合方法，该方法将检索功能的系统与近期语言模型支持的长篇小写窗口结合在一起。我们的方法首先估计了最佳检索长度，该长度是猎犬，摘要和数据集的函数。在数据集的随机采样子集中，我们使用一系列LLMS生成银色参考池。我们使用这些银引用来估计给定的抹布系统配置的最佳上下文长度。我们在多文件摘要任务上的结果展示了我们跨模型类和大小的方法的有效性。我们比较了强大的长篇小写基准（例如标尺和头盔）的长度估计值。我们的分析还强调了我们对非常长的LMS的估计方法的有效性及其对新类LMS的概括。

Title: Sparks of Science: Hypothesis Generation Using Structured Paper Data

Authors: Charles O'Neill, Tirthankar Ghosal, Roberta Răileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, Ioana Ciucă
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.12976
Pdf URL: https://arxiv.org/pdf/2504.12976
Copy Paste: [[2504.12976]] Sparks of Science: Hypothesis Generation Using Structured Paper Data(https://arxiv.org/abs/2504.12976)
Keywords: language model, llm
Abstract: Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at this http URL.
摘要：产生新颖和创造性的科学假设是实现人工通用智能的基石。大型语言和推理模型有可能协助对科学知情假设进行系统的创造，选择和验证。但是，当前的基础模型通常很难产生既新颖又可行的科学思想。原因之一是缺乏将科学假设产生（SHG）作为自然语言生成（NLG）任务的专用数据集。在本文中，我们介绍了hypogen，这是大约5500个结构性问题 - 类假设对的数据集，这些数据集是从带有位翼型带式架构构成的顶级计算机科学会议中提取的，其中位是常规的假设，火花是关键的洞察力或概念上的飞跃，翻转是由此产生的反应。 Hypogen独特地整合了明确的链条组件，该组件反映了从位到翻转的智力过程。我们证明，将框架假设产生作为有条件的语言建模，模型对位纤维带和反应链进行了微调（在推论时，我们只提供了位），从而改善了假设的整体质量。我们的评估采用自动指标和LLM法官排名来进行整体质量评估。我们表明，通过微调六型数据集，我们可以提高生成假设的新颖性，可行性和整体质量。该HTTP URL可公开使用Hypogen数据集。

Title: Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Reliable Response Generation in the Wild

Authors: Jiatai Wang, Zhiwei Xu, Di Jin, Xuewen Yang, Tao Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12982
Pdf URL: https://arxiv.org/pdf/2504.12982
Copy Paste: [[2504.12982]] Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Reliable Response Generation in the Wild(https://arxiv.org/abs/2504.12982)
Keywords: language model, llm, retrieval augmented generation
Abstract: The proliferation of large language models (LLMs) has significantly advanced information retrieval systems, particularly in response generation (RG). Unfortunately, LLMs often face knowledge conflicts between internal memory and retrievaled external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences. However, when the distinction is ambiguous, LLMs experience heightened uncertainty. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models into adaptive augmentation of retrieved information and guiding LLM preference in response generation. Extensive experiments on single-choice, open-ended question-answering (QA), and retrieval augmented generation (RAG) validate our theoretical findings and demonstrate the efficacy of Swin-VIB. Notably, our method improves single-choice task accuracy by at least 7.54\% over competitive baselines.
摘要：大语言模型（LLM）的扩散具有明显的高级信息检索系统，尤其是在响应生成（RG）中。不幸的是，LLM经常面临内部记忆和检索外部信息之间的知识冲突，这是由于错误信息，偏见或过时的知识引起的。这些冲突破坏了响应的可靠性，并引入了决策中的不确定性。在这项工作中，我们分析了LLMS从信息理论的角度导航知识冲突，并透露，在冲突和补充信息的差异显着差异时，LLMS自信地解决了他们的偏好。但是，当区别模棱两可时，LLMS会经历不确定性的加剧。基于这种见解，我们提出了Swin-Vib，这是一个新颖的框架，将各种信息瓶颈模型的管道集成到自适应增强检索信息和指导LLM偏好中的响应生成中。关于单选，开放式问题避开（QA）和检索增强产生（RAG）的广泛实验验证了我们的理论发现，并证明了Swin-Vib的功效。值得注意的是，我们的方法将单选项任务准确性提高至少7.54 \％，而不是竞争基线。

Title: SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation

Authors: Saransh Agrawal, Kuan-Hao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12996
Pdf URL: https://arxiv.org/pdf/2504.12996
Copy Paste: [[2504.12996]] SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation(https://arxiv.org/abs/2504.12996)
Keywords: language model, llm
Abstract: Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.
摘要：大型语言模型（LLM）经常在培训期间记住敏感信息，在部署公共访问模型时会带来风险。当前的机器未学习方法难以选择性地删除特定的数据关联而不会降低整体模型功能。本文介绍了我们针对目标划分的Semeval-2025任务4的解决方案，该任务4引入了两阶段方法，该方法将因果中介分析与层特异性优化结合在一起。通过在Olmo架构（1B和7B参数）上进行系统的因果追踪实验，我们确定了前几个变压器层（0-5层）在MLP模块中存储主题 - 归类关联中的关键作用。在这种见解的基础上，我们开发了一种约束优化方法，该方法在将新颖的关节损失函数应用于下层中，从而相同地通过输出令牌跨透镜惩罚来最大程度地最大化设置损失，并通过自适应正则化来最大程度地减少保留设置偏差。我们的方法在1B模型轨道中获得了第二名，表明了强大的任务性能，同时保持了88％的基线MMLU精度。这些结果建立了因果信息层优化，作为在LLM中有效，精确地学习的有希望的范式，在解决AI系统中的数据隐私问题方面迈出了重要一步。

Title: ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images

Authors: Sangwook Kim, Soonyoung Lee, Jongseong Jang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2504.13023
Pdf URL: https://arxiv.org/pdf/2504.13023
Copy Paste: [[2504.13023]] ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images(https://arxiv.org/abs/2504.13023)
Keywords: language model, llm, chat
Abstract: Recent studies have made significant progress in developing large language models (LLMs) in the medical domain, which can answer expert-level questions and demonstrate the potential to assist clinicians in real-world clinical scenarios. Studies have also witnessed the importance of integrating various modalities with the existing LLMs for a better understanding of complex clinical contexts, which are innately multi-faceted by nature. Although studies have demonstrated the ability of multimodal LLMs in histopathology to answer questions from given images, they lack in understanding of thorough clinical context due to the patch-level data with limited information from public datasets. Thus, developing WSI-level MLLMs is significant in terms of the scalability and applicability of MLLMs in histopathology. In this study, we introduce an expert-level MLLM for histopathology using WSIs, dubbed as ChatEXAONEPath. We present a retrieval-based data generation pipeline using 10,094 pairs of WSIs and histopathology reports from The Cancer Genome Atlas (TCGA). We also showcase an AI-based evaluation protocol for a comprehensive understanding of the medical context from given multimodal information and evaluate generated answers compared to the original histopathology reports. We demonstrate the ability of diagnosing the given histopathology images using ChatEXAONEPath with the acceptance rate of 62.9% from 1,134 pairs of WSIs and reports. Our proposed model can understand pan-cancer WSIs and clinical context from various cancer types. We argue that our proposed model has the potential to assist clinicians by comprehensively understanding complex morphology of WSIs for cancer diagnosis through the integration of multiple modalities.
摘要：最近的研究在医疗领域开发大型语言模型（LLM）方面取得了重大进展，该模型可以回答专家级问题，并证明有可能在现实世界中临床方案协助临床医生。研究还见证了将各种方式与现有LLM融合在一起的重要性，以更好地理解复杂的临床环境，而复杂的临床环境本质上是多方面的。尽管研究表明，多模式LLM在组织病理学中回答给定图像的问题的能力，但由于贴片级数据，他们缺乏对透彻临床环境的理解，而公共数据集的信息有限。因此，开发WSI级别的MLLM在组织病理学中MLLM的可伸缩性和适用性方面非常重要。在这项研究中，我们使用WSIS介绍了用于组织病理学的专家级MLLM，称为ChatexaonePath。我们使用10,094对WSIS和癌症基因组图集（TCGA）的组织病理学报告（TCGA）提出了基于检索的数据生成管道。我们还展示了一种基于AI的评估协议，以全面了解给定的多模式信息，并与原始的组织病理学报告相比评估生成的答案。我们证明了使用ChatexaonePath诊断给定的组织病理学图像的能力，从1,134对WSI和报告中的接受率为62.9％。我们提出的模型可以了解各种癌症类型的Pan-Cancer WSI和临床环境。我们认为，我们提出的模型有可能通过通过整合多种方式来全面理解WSI的复杂形态来协助临床医生。

Title: Aspect-Based Summarization with Self-Aspect Retrieval Enhanced Generation

Authors: Yichao Feng, Shuai Zhao, Yueqiu Li, Luwei Xiao, Xiaobao Wu, Anh Tuan Luu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13054
Pdf URL: https://arxiv.org/pdf/2504.13054
Copy Paste: [[2504.13054]] Aspect-Based Summarization with Self-Aspect Retrieval Enhanced Generation(https://arxiv.org/abs/2504.13054)
Keywords: language model, hallucination, prompt
Abstract: Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.
摘要：基于方面的摘要旨在生成针对特定方面的摘要，以解决传统摘要方法的资源限制和有限的概括性。最近，大型语言模型在此任务中表现出了希望，而无需培训。但是，他们过分依赖迅速的工程和面向令牌的限制和幻觉挑战，尤其是在文化学习中。为了解决这些挑战，在本文中，我们为基于方面的摘要提出了一个新的框架：自觉检索增强的摘要生成。我们不仅要依赖于文本学习，而是采用嵌入式驱动的检索机制来识别其相关文本段。这种方法可以提取相关内容的同时避免不必要的细节，从而减轻令牌限制的挑战。此外，我们的框架通过删除文本的无关部分并确保模型根据给定方面严格生成输出来优化令牌用法。通过在基准数据集上进行广泛的实验，我们证明了我们的框架不仅可以实现卓越的性能，而且可以有效地减轻令牌限制问题。

Title: Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

Authors: Sudesh Ramesh Bhagat, Ibne Farabi Shihab, Anuj Sharma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13068
Pdf URL: https://arxiv.org/pdf/2504.13068
Copy Paste: [[2504.13068]] Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models(https://arxiv.org/abs/2504.13068)
Keywords: language model, gpt, llm
Abstract: This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.
摘要：这项研究探讨了深度学习（DL）模型的准确性与崩溃叙事分类中的专家一致性之间的关系。我们根据专家标记的数据和叙事文本评估了五个DL模型 - 包括BERT变体，通用句子编码器（使用）和零击分类器。该分析进一步扩展到四种大型语言模型（LLMS）：GPT-4，Llama 3，Qwen和Claude。我们的结果表明，违反直觉的趋势：具有较高技术准确性的模型通常与域专家表现出较低的一致性，而LLMS表现出更高的专家一致性，尽管精度得分相对较低。为了量化和解释模型专家协议，我们采用Cohen的Kappa，主要组件分析（PCA）和基于SHAP的可解释性技术。调查结果表明，专家一致的模型倾向于更多地依赖上下文和时间语言提示，而不是特定于位置的关键字。这些结果表明，仅精度不足以评估安全至关重要的NLP应用中的模型。我们主张将专家协议作为互补指标纳入模型评估框架中，并强调LLMs作为碰撞分析管道的可解释，可伸缩工具的承诺。

Title: Retrieval-Augmented Generation with Conflicting Evidence

Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13079
Pdf URL: https://arxiv.org/pdf/2504.13079
Copy Paste: [[2504.13079]] Retrieval-Augmented Generation with Conflicting Evidence(https://arxiv.org/abs/2504.13079)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.
摘要：大型语言模型（LLM）代理人越来越多地利用检索功能的生成（RAG）来改善其反应的事实。但是，实际上，这些系统通常需要处理模棱两可的用户查询以及来自多个来源的潜在相互矛盾的信息，同时也抑制了来自嘈杂或无关紧要的文档的不准确信息。先前的工作通常研究并解决了这些挑战，一次仅考虑一个方面，例如处理歧义或稳健性的噪声和错误信息。相反，我们同时考虑了多个因素，提出了（i）ramdocs（在文档中检索歧义性和错误信息），这是一个新数据集，该数据集模拟了与用户查询的相互冲突证据的复杂和现实情况，包括歧义，错误信息和噪音；（ii）拉格夫人，一种多代理的方法，其中LLM代理商在多个回合中就答案的优点进行了争论，从而使聚合者能够整理到对应于歧义实体的响应，同时丢弃错误的信息和噪音，从而处理多种冲突来源。 We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct.此外，我们发现Ramdocs对现有的抹布基线构成了挑战（Llama3.3-70B-Instruct仅获得32.60的精确匹配分数）。虽然拉格女士开始解决这些矛盾的因素，但我们的分析表明，在增加支持证据和错误信息方面的不平衡水平时，尤其是存在很大的差距。

Title: LLMs Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard

Authors: Varun Rao, Youran Sun, Mahendra Kumar, Tejas Mutneja, Agastya Mukherjee, Haizhao Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13125
Pdf URL: https://arxiv.org/pdf/2504.13125
Copy Paste: [[2504.13125]] LLMs Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard(https://arxiv.org/abs/2504.13125)
Keywords: language model, llm
Abstract: This paper investigates the application of large language models (LLMs) to financial tasks. We fine-tuned foundation models using the Open FinLLM Leaderboard as a benchmark. Building on Qwen2.5 and Deepseek-R1, we employed techniques including supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) to enhance their financial capabilities. The fine-tuned models demonstrated substantial performance gains across a wide range of financial tasks. Moreover, we measured the data scaling law in the financial domain. Our work demonstrates the potential of large language models (LLMs) in financial applications.
摘要：本文研究了大型语言模型（LLM）在财务任务中的应用。我们使用Open Finllm排行榜作为基准进行了微调的基础模型。在QWEN2.5和DeepSeek-R1的基础上，我们采用了包括监督微调（SFT），直接偏好优化（DPO）和加固学习（RL）的技术来增强其财务能力。微调模型显示了各种财务任务的绩效增长。此外，我们测量了金融领域中的数据扩展定律。我们的工作证明了大语模型（LLM）在财务应用中的潜力。

Title: Energy-Based Reward Models for Robust Language Model Alignment

Authors: Anamika Lochab, Ruqi Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13134
Pdf URL: https://arxiv.org/pdf/2504.13134
Copy Paste: [[2504.13134]] Energy-Based Reward Models for Robust Language Model Alignment(https://arxiv.org/abs/2504.13134)
Keywords: language model, llm
Abstract: Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.
摘要：奖励模型（RMS）对于将大语言模型（LLM）与人类偏好保持一致至关重要。但是，他们经常在捕捉复杂的人类偏好和概括地看不见的数据方面挣扎。为了应对这些挑战，我们引入了基于能量的奖励模型（EBRM），这是一个轻巧的事后改进框架，可增强RM稳健性和泛化。 EBRM明确地对奖励分布进行了建模，捕获了人类偏好的不确定性，并减轻了嘈杂或未对准注释的影响。它通过意识到冲突的数据过滤，标签 - 毫无意义的对比训练和混合初始化来实现这一目标。值得注意的是，EBRM在不进行重新培训的情况下增强了RMS，使其在不同模型和任务之间具有计算有效且可适应性。对RM基准测试的经验评估表明，与标准RMS相比，鲁棒性和概括都显着改善，在安全关键对准任务方面提高了5.97％。此外，增强学习实验证实，我们精致的奖励提高了一致性质量，从而有效地延迟了奖励黑客攻击。这些结果证明了我们作为现有RMS和对齐管道的可扩展有效增强的方法。该代码可在EBRM上找到。

Title: Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

Authors: João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterel, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, Timothy J. O'Donnell
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13139
Pdf URL: https://arxiv.org/pdf/2504.13139
Copy Paste: [[2504.13139]] Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo(https://arxiv.org/abs/2504.13139)
Keywords: language model
Abstract: A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution -- which can differ substantially from the LM's base distribution -- is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains -- Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis -- we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
摘要：广泛的LM应用程序需要生成符合句法或语义约束的文本。施加这种约束可以自然地构成概率调节，但是从结果分布中的确切产生（可能与LM的基本分布有很大差异）通常是棘手的。在这项工作中，我们开发了一种基于顺序蒙特卡洛（SMC）的受控LM生成的体系结构。我们的SMC框架使我们能够在推理时灵活地合并域和特定问题的约束，并在生成过程中根据新信息有效地重新分配计算资源。通过与四个具有挑战性的域（用于数据科学的Python代码生成，文本到SQL，目标推理和分子合成）的多种替代方案和消融相比，我们证明，在几乎没有开销的情况下，我们的方法允许小型开放源语言模型超过8x，以及封闭的，封闭的，以及封闭的，以及封闭式的，以及封闭式的，以及封闭式的型号。为了支持概率的观点，我们表明这些绩效改进是由与后验分布更好近似的驱动。我们的系统建立在Lew等人的框架上。（2023）并与其语言模型概率编程语言集成在一起，为用户提供了一种简单，可编程的方法，将SMC应用于各种受控生成问题。

Title: CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Authors: Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan (Celine)Lin, Jan Kautz, Pavlo Molchanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13161
Pdf URL: https://arxiv.org/pdf/2504.13161
Copy Paste: [[2504.13161]] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training(https://arxiv.org/abs/2504.13161)
Keywords: language model
Abstract: Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: this https URL
摘要：预培训数据集通常是从Web内容中收集的，并且缺乏固有的域分区。例如，广泛使用的数据集（如常见爬网）不包括明确的域标签，同时手动策划标记的数据集（例如堆）是劳动力密集的。因此，尽管在培训前绩效方面，确定最佳的预训练数据混合物仍然是一个具有挑战性的问题。为了应对这些挑战，我们提出了基于聚类的迭代数据混合物引导（CLIMB），这是一个自动化的框架，可在预训练设置中发现，评估和完善数据混合物。具体而言，在语义空间中的攀爬嵌入和簇大规模数据集，然后使用较小的代理模型和预测变量迭代地搜索最佳混合物。当使用这种混合物在400B代币上进行训练时，我们的1B模型超过了最先进的乳白色3.2-1b 2.0％。此外，我们观察到，对特定领域（例如，社会科学）进行优化比随机抽样的提高了5％。最后，我们介绍了Climblab，这是一种经过过滤的1.2亿英里语料库，带有20个群集作为研究游乐场，而Climbmix是一种紧凑而强大的400亿泰式数据集，旨在有效的预训练，以相等的代价预算提供出色的性能。我们分析了最终数据混合物，阐明了最佳数据混合物的特征。我们的数据可用：此HTTPS URL