2025-04-11

Title: EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Authors: Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07100
Pdf URL: https://arxiv.org/pdf/2504.07100
Copy Paste: [[2504.07100]] EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models(https://arxiv.org/abs/2504.07100)
Keywords: language model, llm, prompt
Abstract: The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates five widely-used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compare these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities - models consistently underperform on dialectal inputs compared to Standard American English. EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.
摘要：人类语言的多样性受社会，文化和地区影响的影响，对自然语言处理（NLP）系统提出了重大挑战。现有的基准通常会忽略语言内语言的变化，从而使非标准方言的扬声器服务不足。为了解决这一差距，我们介绍了Endive（英语多样性），这是一个基准，该基准在语言理解，算法推理，数学和逻辑方面评估了五个广泛使用的大型语言模型（LLM）。我们的框架将标准的美国英语数据集转化为五个代表性不足的方言，并使用以母语者的验证示例进行了验证，并通过流畅的评估，偏好测试和语义相似性指标将这些翻译与基于规则的方法进行比较。人类评估证实了高翻译质量，忠诚，流利性和形式的平均得分至少为6.02/7。通过滤除近乎相同的翻译，我们创建了一个具有挑战性的数据集，该数据集揭示了重大的性能差异 - 与标准的美国英语相比，方言输入的模型始终表现不佳。因此，通过发现模型偏见并促进更公平的语言技术，可以推进方言感知的NLP。

Title: How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities

Authors: Aly M. Kassem, Bernhard Schölkopf, Zhijing Jin
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2504.07113
Pdf URL: https://arxiv.org/pdf/2504.07113
Copy Paste: [[2504.07113]] How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities(https://arxiv.org/abs/2504.07113)
Keywords: language model, llm
Abstract: Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited. They largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types, including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking. Additionally, it integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions. For instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM even when simpler models would suffice, while routing jailbreaking attempts to weaker models, thereby elevating safety risks.
摘要：大型语言模型（LLM）路由已成为一种至关重要的策略，可以通过基于查询复杂性将查询动态分配给最合适的模型来平衡计算成本与性能。尽管最近的进展表明，基于偏好的路由器的表现可以胜过传统方法，但当前的评估基准仍有限。他们在很大程度上专注于通用模型能力，同时忽略了特定于任务的行为以及通过偏好数据引入的隐私，安全性和潜在的后门漏洞等关键问题。作为回应，我们提出了DSC基准：多样，简单和分类，这是一个评估框架，将路由器性能分类为各种查询类型，包括编码，翻译，数学，数学，人类的指导，常识和LLM越狱。此外，它集成了隐私和安全评估以揭示隐藏的风险。我们对三个基于偏好的路由器和两个商业对应物进行的实验表明，尽管这些系统提高了效率，但它们通常会做出次优的类别驱动的决策。例如，基于BERT的路由器将所有编码和数学查询都引导到最强大的LLM，即使更简单的模型就足够，同时越狱尝试较弱的模型，从而提高了安全风险。

Title: ChatBench: From Static Benchmarks to Human-AI Evaluation

Authors: Serina Chang, Ashton Anderson, Jake M. Hofman
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2504.07114
Pdf URL: https://arxiv.org/pdf/2504.07114
Copy Paste: [[2504.07114]] ChatBench: From Static Benchmarks to Human-AI Evaluation(https://arxiv.org/abs/2504.07114)
Keywords: llm, chat
Abstract: With the rapid adoption of LLM-based chatbots, there is a pressing need to evaluate what humans and LLMs can achieve together. However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e., "AI-alone"). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question. We release ChatBench, a new dataset with AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144K answers and 7,336 user-AI conversations. We find that AI-alone accuracy fails to predict user-AI accuracy, with significant differences across multiple subjects (math, physics, and moral reasoning), and we analyze the user-AI conversations to provide insight into how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a user simulator on a subset of ChatBench improves its ability to estimate user-AI accuracies, increasing correlation on held-out questions by more than 20 points, creating possibilities for scaling interactive evaluation.
摘要：随着基于LLM的聊天机器人的迅速采用，迫切需要评估人类和LLM可以共同实现的目标。但是，标准的基准（例如MMLU）孤立地测量LLM功能（即“ Ai-horone”）。在这里，我们通过向用户播种问题并让他们与LLM进行对话以回答他们的问题，从而设计和进行用户研究，以将MMLU问题转换为用户对话。我们发布ChatBench，这是一个新的数据集，具有AI-OLONE，单词和用户-AI数据，用于396个问题和两个LLM，包括144K Answers和7,336个用户-AI对话。我们发现AI-ORONOME精度无法预测用户AI的准确性，并且在多个受试者（数学，物理和道德推理）之间存在显着差异，并且我们分析了用户-AI对话，以提供有关它们与AI-Olone基准分歧的见解。最后，我们表明，在Chatbench子集中对用户模拟器进行微型模拟器提高了其估计用户-AI精度的能力，从而将持有问题的相关性提高了20点以上，从而为扩展交互式评估提供了可能性。

Title: CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning

Authors: Andrew Rufail, Daniel Kim, Sean O'Brien, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07116
Pdf URL: https://arxiv.org/pdf/2504.07116
Copy Paste: [[2504.07116]] CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning(https://arxiv.org/abs/2504.07116)
Keywords: language model
Abstract: We introduce CLEAR (Contrasting Textual Feedback with Experts and Amateurs for Reasoning), a novel approach to language model reasoning that leverages the strengths of a larger (expert) model and smaller (amateur) model. The expert and amateur models each provide feedback on a model's initial output and are contrasted with each other into refined feedback. This feedback is subsequently applied to iteratively improve CLEAR's responses. Our experiments demonstrate that CLEAR outperforms state-of-the-art methods in several challenging reasoning tasks, including story outline improvement (up to 19.6% relative increase in interestingness), constrained generation (up to 18.5% increase in coverage), mathematical reasoning (up to 6.7% improvement in accuracy) and mitigation of toxicity (decrease of up to 22% in toxicity).
摘要：我们介绍了清晰的（将文本反馈与专家和业余爱好者进行推理的对比），这是一种新型的语言模型推理方法，利用了较大（专家）模型和较小（业余）模型的优势。专家和业余模型各自提供有关模型初始输出的反馈，并将彼此形成鲜明对比。随后将此反馈应用于迭代改善Clear的响应。我们的实验表明，在几项具有挑战性的推理任务中，明确的实验胜过最先进的方法，包括故事大纲的改进（兴趣相对相对增加），发电约为18.5％（覆盖范围提高了18.5％），数学推理，数学推理（准确性高达6.7％）和毒性降低毒性（高达6.7％）和毒性的降低（降低毒性降低22％）。

Title: DeepSeek-R1 Thoughtology: Let's about LLM Reasoning

Authors: Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07128
Pdf URL: https://arxiv.org/pdf/2504.07128
Copy Paste: [[2504.07128]] DeepSeek-R1 Thoughtology: Let's about LLM Reasoning(https://arxiv.org/abs/2504.07128)
Keywords: llm
Abstract: Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-à-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.
摘要：诸如DeepSeek-R1之类的大型推理模型标志着LLM如何处理复杂问题的基本转变。 DeepSeek-R1没有直接为给定输入提供答案，而是创建详细的多步推理链，在提供答案之前似乎对问题“思考”。用户公开使用此推理过程，为研究模型的推理行为和开放思想学领域创造了无尽的机会。从DeepSeek-R1的基本推理基本构建基础的分类学开始，我们对DeepSeek-R1的分析研究了思想长度的影响和可控性，长期或令人困惑的环境，文化和安全问题的管理以及DeepSeek-R1相对于类似人类的语言处理和世界模型的DeepSeek-R1认知现象的状态。我们的发现描绘了细微的图片。值得注意的是，我们表明DeepSeek-R1具有“最佳选择”推理，额外的推理时间可以损害模型性能。此外，我们发现DeepSeek-R1持续反思以前探讨的问题制剂，阻碍进一步探索的趋势。我们还注意到，与其非争议的同行相比，DeepSeek-R1的强烈安全漏洞，这也可能损害与安全一致的LLM。

Title: HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Authors: Mingxuan Li, Hanchen Li, Chenhao Tan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.07174
Pdf URL: https://arxiv.org/pdf/2504.07174
Copy Paste: [[2504.07174]] HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation(https://arxiv.org/abs/2504.07174)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
摘要：大型语言模型（LLMS）表现出了自动化自然语言产生的巨大潜力。 LLM-AS-A-a-gudge的先前框架以两种方式下降了：他们要么使用零拍设置而无需咨询任何人类输入，这会导致较低的对齐方式，或者在标记的数据上进行了微调LLM，这需要非平凡的样品。此外，以前的方法通常在自动化评估背后几乎没有推理。在本文中，我们提出了假设的假设引导的评估框架，该框架首先使用一小部分人类评估来为人类判断生成更详细的专栏，然后将类似清单的清单样方法结合在一起，以结合LLM对每个分解的维度分配的分数，以获取整体分数。只有30次人力评估，低词与人类排名（Spearman相关性）和人类得分（Pearson相关性）的一致性达到最先进的表现，平均表现G-eval的G-eval的G-eval比11.86％和微调的Llama-3.1-8B结构至少比至少3倍的人为人为评估，以占11.95％。此外，我们进行系统的研究以评估降压管的鲁棒性，从而强调了其有效性作为可靠且可解释的自动化评估框架。

Title: SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

Authors: Jennifer D'Souza, Sameer Sadruddin, Holger Israel, Mathias Begoin, Diana Slawig
Subjects: cs.CL, cs.AI, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.07199
Pdf URL: https://arxiv.org/pdf/2504.07199
Copy Paste: [[2504.07199]] SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog(https://arxiv.org/abs/2504.07199)
Keywords: llm
Abstract: We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into applying LLMs for digital library classification.
摘要：我们介绍了Semeval-2025任务5：LLMS4Subjects，这是使用GND分类法的科学和技术记录的自动化主题标记的共享任务。参与者开发了基于LLM的系统来推荐TOP-K受试者，并通过主题专家通过定量指标（精度，召回，F1得分）和定性评估进行评估。结果突出了LLM合奏，合成数据生成和多语言处理的有效性，从而为应用LLMS用于数字图书馆分类提供了见解。

Title: ConceptCarve: Dynamic Realization of Evidence

Authors: Eylon Caplan, Dan Goldwasser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07228
Pdf URL: https://arxiv.org/pdf/2504.07228
Copy Paste: [[2504.07228]] ConceptCarve: Dynamic Realization of Evidence(https://arxiv.org/abs/2504.07228)
Keywords: llm
Abstract: Finding evidence for human opinion and behavior at scale is a challenging task, often requiring an understanding of sophisticated thought patterns among vast online communities found on social media. For example, studying how gun ownership is related to the perception of Freedom, requires a retrieval system that can operate at scale over social media posts, while dealing with two key challenges: (1) identifying abstract concept instances, (2) which can be instantiated differently across different communities. To address these, we introduce ConceptCarve, an evidence retrieval framework that utilizes traditional retrievers and LLMs to dynamically characterize the search space during retrieval. Our experiments show that ConceptCarve surpasses traditional retrieval systems in finding evidence within a social media community. It also produces an interpretable representation of the evidence for that community, which we use to qualitatively analyze complex thought patterns that manifest differently across the communities.
摘要：在大规模上找到人类意见和行为的证据是一项具有挑战性的任务，通常需要了解社交媒体上广阔的在线社区中的复杂思维模式。例如，研究枪支所有权与对自由的看法如何相关，需要一个可以通过社交媒体帖子进行大规模运行的检索系统，同时应对两个关键挑战：（1）确定抽象概念实例，（2）可以在不同社区之间实例化。为了解决这些问题，我们介绍了ConceptCarve，这是一个证据检索框架，该框架利用传统的猎犬和LLMS在检索过程中动态表征搜索空间。我们的实验表明，ConceptCarve超过了传统的检索系统，可以在社交媒体社区中找到证据。它还为该社区的证据提供了可解释的代表，我们用来分析在整个社区中表现出不同不同的复杂思维模式。

Title: Language Modeling for the Future of Finance: A Quantitative Survey into Metrics, Tasks, and Data Opportunities

Authors: Nikita Tatarinov, Siddhant Sukhani, Agam Shah, Sudheer Chava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07274
Pdf URL: https://arxiv.org/pdf/2504.07274
Copy Paste: [[2504.07274]] Language Modeling for the Future of Finance: A Quantitative Survey into Metrics, Tasks, and Data Opportunities(https://arxiv.org/abs/2504.07274)
Keywords: language model
Abstract: Recent advances in language modeling have led to growing interest in applying Natural Language Processing (NLP) techniques to financial problems, enabling new approaches to analysis and decision-making. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 qualitative and quantitative dimensions, identifying key trends such as the increasing use of general-purpose language models, steady progress in sentiment analysis and information extraction, and emerging efforts around explainability and privacy-preserving methods. We also discuss the use of evaluation metrics, highlighting the importance of domain-specific ones to complement standard machine learning metrics. Our findings emphasize the need for more accessible, adaptive datasets and highlight the significance of incorporating financial crisis periods to strengthen model robustness under real-world conditions. This survey provides a structured overview of NLP research applied to finance and offers practical insights for researchers and practitioners working at this intersection.
摘要：语言建模的最新进展导致对将自然语言处理（NLP）技术应用于财务问题的兴趣日益增加，从而实现了分析和决策的新方法。为了系统地检查这一趋势，我们回顾了38个会议和讲习班之间在2017年至2024年之间发表的374个NLP研究论文，并对221篇论文进行了重点分析，这些论文直接解决了与财务相关的任务。我们在11个定性和定量维度上评估了这些论文，并确定了关键趋势，例如通用语言模型的使用日益增加，情感分析和信息提取中的稳定进步以及围绕解释性和隐私保护方法的新兴努力。我们还讨论了评估指标的使用，强调了特定于域的标准机器学习指标的重要性。我们的发现强调了需要更容易访问，自适应数据集的必要性，并强调了将金融危机期纳入在现实世界条件下增强模型鲁棒性的重要性。这项调查提供了适用于金融的NLP研究的结构化概述，并为在此交叉路口工作的研究人员和从业人员提供了实用的见解。

Title: RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models

Authors: Lv Qingsong, Yangning Li, Zihua Lan, Zishan Xu, Jiwei Tang, Yinghui Li, Wenhao Jiang, Hai-Tao Zheng, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07282
Pdf URL: https://arxiv.org/pdf/2504.07282
Copy Paste: [[2504.07282]] RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models(https://arxiv.org/abs/2504.07282)
Keywords: language model, llm
Abstract: In the instruction fine-tuning of large language models (LLMs), it has become a consensus that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. So we designed a dynamic, task-objective-driven instruction selection framework RAISE(Reinforenced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instruction at each step based on the expected impact of instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1\% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.
摘要：在大型语言模型（LLMS）的指导微调中，已经达成共识，即一些高质量的说明优于大量低质量指令。目前，已经提出了许多指导选择方法，但是这些方法中的大多数基于启发式质量指标选择了指令，并且仅在培训前考虑数据选择。这些设计导致教学微调的优化不足，固定的启发式指标通常很难针对特定任务进行优化。因此，我们设计了一个动态的，任务 - 目标驱动的指令选择框架提高（增强自适应指令选择），该框架将整个指令微调过程纳入优化，并根据指令对模型性能改进的预期影响在每个步骤中选择指令。我们的方法是易于解释的，并且具有特定于任务的优化功能。通过将动态指导选择建模为顺序决策过程，我们使用RL来训练我们的选择策略。与其他指导选择方法相比，广泛的实验和结果分析证明了我们方法的优越性。值得注意的是，与全DATA培训相比，RAIS仅通过更新1 \％的培训步骤来实现卓越的性能，从而证明其效率和有效性。

Title: MDIT: A Model-free Data Interpolation Method for Diverse Instruction Tuning

Authors: Yangning Li, Zihua Lan, Lv Qingsong, Yinghui Li, Hai-Tao Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07288
Pdf URL: https://arxiv.org/pdf/2504.07288
Copy Paste: [[2504.07288]] MDIT: A Model-free Data Interpolation Method for Diverse Instruction Tuning(https://arxiv.org/abs/2504.07288)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) are increasingly applied across various tasks, instruction tuning has emerged as a critical method for enhancing model performance. However, current data management strategies face substantial challenges in generating diverse and comprehensive data, restricting further improvements in model performance. To address this gap, we propose MDIT, a novel model-free data interpolation method for diverse instruction tuning, which generates varied and high-quality instruction data by performing task interpolation. Moreover, it contains diversity-based clustering strategies to ensure the diversity of the training data. Extensive experiments show that our method achieves superior performance in multiple benchmark tasks. The LLMs finetuned with MDIT show significant improvements in numerous tasks such as general question answering, math reasoning, and code generation. MDIT offers an efficient and automatic data synthetic method, generating diverse instruction data without depending on external resources while expanding the application potential of LLMs in complex environments.
摘要：随着大型语言模型（LLM）越来越多地在各种任务中应用，指令调整已成为增强模型性能的关键方法。但是，当前的数据管理策略在产生多样化和全面的数据方面面临着重大挑战，从而限制了模型性能的进一步改进。为了解决这一差距，我们提出了MDIT，这是一种用于不同指令调整的新型无模型数据插值方法，该方法通过执行任务插值来生成多样化和高质量的指令数据。此外，它包含基于多样性的聚类策略，以确保培训数据的多样性。广泛的实验表明，我们的方法在多个基准任务中实现了卓越的性能。使用MDIT进行的LLMS对LLMS进行了重大改进，例如一般问答，数学推理和代码生成等许多任务。 MDIT提供了一种高效且自动的数据合成方法，生成了不同的指令数据，而无需依赖外部资源，同时扩大了LLM在复杂环境中的应用潜力。

Title: PAYADOR: A Minimalist Approach to Grounding Language Models on Structured Data for Interactive Storytelling and Role-playing Games

Authors: Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07304
Pdf URL: https://arxiv.org/pdf/2504.07304
Copy Paste: [[2504.07304]] PAYADOR: A Minimalist Approach to Grounding Language Models on Structured Data for Interactive Storytelling and Role-playing Games(https://arxiv.org/abs/2504.07304)
Keywords: language model
Abstract: Every time an Interactive Storytelling (IS) system gets a player input, it is facing the world-update problem. Classical approaches to this problem consist in mapping that input to known preprogrammed actions, what can severely constrain the free will of the player. When the expected experience has a strong focus on improvisation, like in Role-playing Games (RPGs), this problem is critical. In this paper we present PAYADOR, a different approach that focuses on predicting the outcomes of the actions instead of representing the actions themselves. To implement this approach, we ground a Large Language Model to a minimal representation of the fictional world, obtaining promising results. We make this contribution open-source, so it can be adapted and used for other related research on unleashing the co-creativity power of RPGs.
摘要：每当互动讲故事（IS）系统获得播放器的输入时，它就会面临世界范围的问题。该问题的经典方法包括将输入映射到已知的预编程动作，什么可以严重限制玩家的自由意志。当预期的体验重点放在即兴演奏上，例如角色扮演游戏（RPG），此问题至关重要。在本文中，我们介绍了Payador，另一种侧重于预测行动结果而不是代表行动本身的方法。为了实施这种方法，我们将大型语言模型扎根，以最小化虚构的世界，从而获得了有希望的结果。我们为开源做出了这种贡献，因此可以对其进行调整并用于释放RPG的共同创造力的其他相关研究。

Title: Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization

Authors: Shujin Wu, Cheng Qian, Yi R. (May)Fung, Paul Pu Liang, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07316
Pdf URL: https://arxiv.org/pdf/2504.07316
Copy Paste: [[2504.07316]] Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization(https://arxiv.org/abs/2504.07316)
Keywords: language model, llm
Abstract: The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students from employing their knowledge during training and reaching their full potential. In this work, we introduce Alice (pro{A}ctive {l}earning w{i}th tea{c}her's D{e}monstrations), a framework that leverages complementary knowledge between teacher and student to enhance the learning this http URL probe the knowledge base of the teacher model by eliciting their uncertainty, and then use these insights together with teachers' responses as demonstrations to guide student models in self-generating improved responses for supervision. In addition, for situations with significant capability gaps between teacher and student models, we introduce cascade Alice, which employs a hierarchical training approach where weak teachers initially supervise intermediate models, who then guide stronger models in sequence. Experimental results demonstrate that our method significantly enhances the W2SG performance, yielding substantial improvements in three key tasks compared to the original W2SG: knowledge-based reasoning (+4.0%), mathematical reasoning (+22.62%), and logical reasoning (+12.11%). This highlights the effectiveness of our new W2SG paradigm that enables more robust knowledge transfer and supervision outcome.
摘要：大型语言模型（LLM）的越来越多的能力提出了维持有效人类监督的关键挑战。弱到紧密的概括（W2SG）为使用较弱的LLM进行了有前途的框架。传统的W2SG方法依赖于被动学习，在该学习中，一个弱老师提供嘈杂的演示来培训强大的学生。这阻碍了学生在培训期间运用知识并发挥全部潜力。在这项工作中，我们介绍了爱丽丝（pro {a} ctive {l}赚取w {i} the {i} th th th th th th th th th th th her d {e} monstrations，一个框架，该框架是教师和学生之间的互补知识来利用教师和学生之间的互补知识，以增强学习http url的模型，然后将他们的模型与他们的知识相同，然后用教师的模型来指导他们的知识基础，并以识别为指导。在自我生成的改进反应中，以进行监督。此外，对于在教师和学生模型之间具有巨大能力差距的情况，我们介绍了喀斯喀特爱丽丝（Alice），该级联采用了一种等级培训方法，其中弱教师最初监督中级模型，然后他们按顺序指导更强大的模型。实验结果表明，与原始的W2SG相比，我们的方法显着提高了W2SG的性能，从而实现了三个关键任务的重大改进：基于知识的推理（+4.0％），数学推理（+22.62％）和逻辑推理（+12.11％）。这突出了我们新的W2SG范式的有效性，该范式可以使知识转移和监督结果更加强大。

Title: Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

Authors: Saurabh Srivastava, Ziyu Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07357
Pdf URL: https://arxiv.org/pdf/2504.07357
Copy Paste: [[2504.07357]] Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction(https://arxiv.org/abs/2504.07357)
Keywords: language model, gpt, llm, prompt
Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.
摘要：大型推理模型（LRMS），例如DeepSeek-R1和OpenAI O1，在各种推理任务中表现出了非凡的功能。他们在中间思想上产生和理性的强大能力也导致了论点，即他们可能不再需要广泛的及时工程或优化来解释人类的指示并产生准确的输出。在这项工作中，我们旨在使用案例研究的事件提取的结构化任务来系统地研究这个空旷的问题。我们尝试了两个LRM（DeepSeek-R1和O1）和两个通用大语模型（LLMS）（GPT-4O和GPT-4.5），当时它们被用作任务模型或提示优化器。我们的结果表明，在像事件提取一样复杂的任务上，LRMS作为任务模型仍然受益于及时的优化，并且使用LRMS作为及时的优化器会产生更有效的提示。最后，我们对LRMS犯下的常见错误进行了错误分析，并突出了LRM在完善任务说明和事件指南中的稳定性和一致性。

Title: Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs

Authors: Taibiao Zhao, Xiaobing Chen, Mingxuan Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.07360
Pdf URL: https://arxiv.org/pdf/2504.07360
Copy Paste: [[2504.07360]] Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs(https://arxiv.org/abs/2504.07360)
Keywords: language model, llm
Abstract: The adaptation of large language models (LLMs) to time series forecasting poses unique challenges, as time series data is continuous in nature, while LLMs operate on discrete tokens. Despite the success of LLMs in natural language processing (NLP) and other structured domains, aligning time series data with language-based representations while maintaining both predictive accuracy and interpretability remains a significant hurdle. Existing methods have attempted to reprogram time series data into text-based forms, but these often fall short in delivering meaningful, interpretable results. In this paper, we propose a multi-level text alignment framework for time series forecasting using LLMs that not only improves prediction accuracy but also enhances the interpretability of time series representations. Our method decomposes time series into trend, seasonal, and residual components, which are then reprogrammed into component-specific text representations. We introduce a multi-level alignment mechanism, where component-specific embeddings are aligned with pre-trained word tokens, enabling more interpretable forecasts. Experiments on multiple datasets demonstrate that our method outperforms state-of-the-art models in accuracy while providing good interpretability.
摘要：大型语言模型（LLM）对时间序列的改编预测，随着时间序列数据本质上是连续的，而LLMS则以离散令牌运行。尽管LLM在自然语言处理（NLP）和其他结构化领域中取得了成功，但将时间序列数据与基于语言的表示的一致，同时保持预测精度和可解释性仍然是一个重大障碍。现有的方法已尝试将时间序列数据重新编程为基于文本的表单，但是这些方法通常在提供有意义的，可解释的结果方面不足。在本文中，我们建议使用LLMS进行时间序列预测的多级文本对齐框架，该框架不仅提高了预测准确性，还提高了时间序列表示的可解释性。我们的方法将时间序列分解为趋势，季节性和残留成分，然后将其重编程为特定于组件的文本表示。我们引入了一种多级比对机制，其中特定于组件的嵌入与预训练的单词令牌对齐，从而实现了更可解释的预测。多个数据集上的实验表明，我们的方法在提供良好的可解释性的同时，优于最先进的模型。

Title: TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models

Authors: Sher Badshah, Ali Emami, Hassan Sajjad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07385
Pdf URL: https://arxiv.org/pdf/2504.07385
Copy Paste: [[2504.07385]] TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models(https://arxiv.org/abs/2504.07385)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world, autonomous applications, relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to fixed references or depend solely on LLM-as-a-judge knowledge, TALE employs an agent with tool-access capabilities that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By shifting away from static references, TALE aligns with free-form question-answering tasks common in real-world scenarios. Experimental results on multiple free-form QA benchmarks show that TALE not only outperforms standard reference-based metrics for measuring response accuracy but also achieves substantial to near-perfect agreement with human evaluations. TALE enhances the reliability of LLM evaluations in real-world, dynamic scenarios without relying on static references.
摘要：随着大型语言模型（LLM）越来越多地集成到现实世界中，自主应用程序依赖于静态的，预先注销的参考文献对评估构成了重大挑战。我们提出了工具增强的LLM评估（Tale），该框架是评估LLM输出而无需预先确定的地面真相答案的框架。与常规指标相比，与固定的参考相比或仅取决于llm-as-a-a-a-a-Gudge知识，Tale采用了具有工具访问能力的代理，这些能力可以积极检索和综合外部证据。它迭代地生成Web查询，收集信息，总结发现并通过反思来完善后续搜索。通过摆脱静态参考，故事与现实世界中常见的自由形式的提问任务保持一致。多种自由质量质量检查基准的实验结果表明，故事不仅胜过基于标准参考的标准指标来衡量响应准确性，而且还可以实现与人类评估的接近完美一致性。故事提高了现实世界中LLM评估的可靠性，而动态场景而不依赖静态参考。

Title: Talking Point based Ideological Discourse Analysis in News Events

Authors: Nishanth Nakshatri, Nikhil Mehta, Siyi Liu, Sihao Chen, Daniel J. Hopkins, Dan Roth, Dan Goldwasser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07400
Pdf URL: https://arxiv.org/pdf/2504.07400
Copy Paste: [[2504.07400]] Talking Point based Ideological Discourse Analysis in News Events(https://arxiv.org/abs/2504.07400)
Keywords: llm
Abstract: Analyzing ideological discourse even in the age of LLMs remains a challenge, as these models often struggle to capture the key elements that shape real-world narratives. Specifically, LLMs fail to focus on characteristic elements driving dominant discourses and lack the ability to integrate contextual information required for understanding abstract ideological views. To address these limitations, we propose a framework motivated by the theory of ideological discourse analysis to analyze news articles related to real-world events. Our framework represents the news articles using a relational structure - talking points, which captures the interaction between entities, their roles, and media frames along with a topic of discussion. It then constructs a vocabulary of repeating themes - prominent talking points, that are used to generate ideology-specific viewpoints (or partisan perspectives). We evaluate our framework's ability to generate these perspectives through automated tasks - ideology and partisan classification tasks, supplemented by human validation. Additionally, we demonstrate straightforward applicability of our framework in creating event snapshots, a visual way of interpreting event discourse. We release resulting dataset and model to the community to support further research.
摘要：即使在LLM时代，分析意识形态话语也是一个挑战，因为这些模型通常难以捕获塑造现实世界叙事的关键要素。具体而言，LLM未能专注于推动主导话语的特征要素，并且缺乏整合理解抽象意识形态观点所需的上下文信息的能力。为了解决这些局限性，我们提出了一个由意识形态话语分析理论促进的框架，以分析与现实事件有关的新闻文章。我们的框架代表了使用关系结构的新闻文章 - 谈话要点，它捕获了实体，其角色和媒体框架之间的相互作用以及讨论主题。然后，它构建了重复主题的词汇 - 突出的话题，这些话题用于产生特定于意识形态的观点（或党派观点）。我们评估了框架通过自动任务 - 意识形态和党派分类任务来生成这些观点的能力，并补充了人类验证。此外，我们在创建事件快照方面展示了框架的直接适用性，这是一种解释事件话语的视觉方式。我们将结果数据集和模型释放到社区，以支持进一步的研究。

Title: AI Coding with Few-Shot Prompting for Thematic Analysis

Authors: Samuel Flanders, Melati Nungsari, Mark Cheong Wing Loong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07408
Pdf URL: https://arxiv.org/pdf/2504.07408
Copy Paste: [[2504.07408]] AI Coding with Few-Shot Prompting for Thematic Analysis(https://arxiv.org/abs/2504.07408)
Keywords: language model, gpt, llm, prompt
Abstract: This paper explores the use of large language models (LLMs), here represented by GPT 3.5-Turbo to perform coding for a thematic analysis. Coding is highly labor intensive, making it infeasible for most researchers to conduct exhaustive thematic analyses of large corpora. We utilize few-shot prompting with higher quality codes generated on semantically similar passages to enhance the quality of the codes while utilizing a cheap, more easily scalable model.
摘要：本文探讨了大型语言模型（LLMS）的使用，此处由GPT 3.5-Turbo代表进行主题分析的编码。编码高度密集，使大多数研究人员对大型语料库进行详尽的主题分析是不可行的。我们利用很少的弹药提示，通过在语义相似的段落上生成的高质量代码，以增强代码的质量，同时使用便宜，更容易扩展的模型。

Title: AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery

Authors: Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vazquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07421
Pdf URL: https://arxiv.org/pdf/2504.07421
Copy Paste: [[2504.07421]] AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery(https://arxiv.org/abs/2504.07421)
Keywords: llm, retrieval-augmented generation, agent
Abstract: We introduce AgentAda, the first LLM-powered analytics agent that can learn and use new analytics skills to extract more specialized insights. Unlike existing methods that require users to manually decide which data analytics method to apply, AgentAda automatically identifies the skill needed from a library of analytical skills to perform the analysis. This also allows AgentAda to use skills that existing LLMs cannot perform out of the box. The library covers a range of methods, including clustering, predictive modeling, and NLP techniques like BERT, which allow AgentAda to handle complex analytics tasks based on what the user needs. AgentAda's dataset-to-insight extraction strategy consists of three key steps: (I) a question generator to generate queries relevant to the user's goal and persona, (II) a hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose the best data analytics skill from the skill library, and (III) a code generator that produces executable code based on the retrieved skill's documentation to extract key patterns. We also introduce KaggleBench, a benchmark of curated notebooks across diverse domains, to evaluate AgentAda's performance. We conducted a human evaluation demonstrating that AgentAda provides more insightful analytics than existing tools, with 48.78% of evaluators preferring its analyses, compared to 27.67% for the unskilled agent. We also propose a novel LLM-as-a-judge approach that we show is aligned with human evaluation as a way to automate insight quality evaluation at larger scale.
摘要：我们介绍了第一个可以学习并使用新的分析技能来提取更多专业见解的新分析技巧的AgentaDADA。与需要用户手动确定要应用的数据分析方法的现有方法不同，AgentADA自动确定了分析技能库所需的技能以执行分析。这还允许AgentADA使用现有LLM无法开箱即用的技能。该库涵盖了一系列方法，包括聚类，预测建模和NLP技术（例如Bert），这些方法允许AgentADA根据用户的需求处理复杂的分析任务。 Agenta的数据集对潜在提取策略包括三个关键步骤：（i）生成与用户目标和角色相关的查询的问题，（ii）混合检索的生成生成（RAG）基于基于的技能匹配器，以选择从技能库中的最佳数据分析技能，以及（iii II），该模式基于代码的代码，以删除代码的执行方法。我们还介绍了Kagglebench，这是跨不同领域的策划笔记本的基准，以评估Agenta的性能。我们进行了人类评估，表明AgentA提供了比现有工具更有见地的分析，而48.78％的评估者更喜欢其分析，而非熟练代理人则为27.67％。我们还提出了一种新型的LLM-AS-A-A-A-A-Gudge方法，我们表明，该方法与人类评估保持一致，以此作为一种自动化洞察力质量评估的方式。

Title: From Token to Line: Enhancing Code Generation with a Long-Term Perspective

Authors: Tingwei Lu, Yangning Li, Liyuan Wang, Binghuai Lin, Jiwei Tang, Wanshi Xu, Hai-Tao Zheng, Yinghui Li, Bingxu An, Zhao Wei, Yong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07433
Pdf URL: https://arxiv.org/pdf/2504.07433
Copy Paste: [[2504.07433]] From Token to Line: Enhancing Code Generation with a Long-Term Perspective(https://arxiv.org/abs/2504.07433)
Keywords: language model, llm
Abstract: The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limited focus on choosing the appropriate processing length for generations. By analyzing the attention between tokens during the generation process of LLMs, it can be observed that the high spikes of the attention scores typically appear at the end of lines. This insight suggests that it is reasonable to treat each line of code as a fundamental processing unit and generate them sequentially. Inspired by this, we propose the \textbf{LSR-MCTS} algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path. Further, we integrate a self-refine mechanism at each node to enhance diversity and generate higher-quality programs through error correction. Extensive experiments and comprehensive analyses on three public coding benchmarks demonstrate that our method outperforms the state-of-the-art performance approaches.
摘要：大语言模型（LLM）的出现大大促进了代码生成任务的发展，从而引发了相关文献的激增。当前的研究受到冗余生成结果的阻碍，并且在短期内过度拟合本地模式的趋势。尽管现有的研究试图通过采用多言论预测策略来减轻问题，但仍在关注几代人选择适当的处理长度。通过分析LLMS生成过程中令牌之间的注意力，可以观察到，注意力评分的高尖峰通常出现在线的末端。这种见解表明，将每条代码视为基本处理单元并依次生成它们是合理的。受此启发，我们提出了\ textbf {lsr-mcts}算法，该算法利用MCT确定逐条编码并选择最佳路径。此外，我们在每个节点上整合了一种自我refine机制，以增强多样性并通过误差校正产生更高质量的程序。对三个公共编码基准测试的广泛实验和全面分析表明，我们的方法的表现优于最先进的绩效方法。

Title: Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law

Authors: Yixin Cao, Jiahao Ying, Yaoning Wang, Xipeng Qiu, Xuanjing Huang, Yugang Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07440
Pdf URL: https://arxiv.org/pdf/2504.07440
Copy Paste: [[2504.07440]] Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law(https://arxiv.org/abs/2504.07440)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. In this paper, we analyze the core limitations of traditional evaluation pipelines and propose a novel metric, the Model Utilization Index (MUI), which introduces mechanism interpretability techniques to complement traditional performance metrics. MUI quantifies the extent to which a model leverages its capabilities to complete tasks. The core idea is that to assess an LLM's overall ability, we must evaluate not only its task performance but also the effort expended to achieve the outcome. Our extensive experiments reveal an inverse relationship between MUI and performance, from which we deduce a common trend observed in popular LLMs, which we term the Utility Law. Based on this, we derive four corollaries that address key challenges, including training judgement, the issue of data contamination, fairness in model comparison, and data diversity. We hope that our survey, novel metric, and utility law will foster mutual advancement in both evaluation and mechanism interpretability. Our code can be found at this https URL.
摘要：大型语言模型（LLMS）在学术界，行业和日常应用中都必须是必不可少的，但是当前的评估方法努力保持其快速发展。在本文中，我们分析了传统评估管道的核心局限性，并提出了一种新型指标，即模型利用率指数（MUI），该指数引入了机制可解释性技术以补充传统性能指标。 MUI量化了模型利用其完成任务功能的程度。核心思想是，要评估LLM的整体能力，我们不仅必须评估其任务绩效，而且还必须评估达到结果的努力。我们广泛的实验揭示了MUI与性能之间的反比关系，我们从中推断出在流行的LLM中观察到的共同趋势，我们将其称为“公用事业法”。基于此，我们得出了四个应对关键挑战的推论，包括培训判断，数据污染问题，模型比较中的公平性和数据多样性。我们希望我们的调查，新颖的指标和公用事业法可以在评估和机制解释性方面促进相互进步。我们的代码可以在此HTTPS URL上找到。

Title: Beyond LLMs: A Linguistic Approach to Causal Graph Generation from Narrative Texts

Authors: Zehan Li, Ruhua Pan, Xinyu Pi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07459
Pdf URL: https://arxiv.org/pdf/2504.07459
Copy Paste: [[2504.07459]] Beyond LLMs: A Linguistic Approach to Causal Graph Generation from Narrative Texts(https://arxiv.org/abs/2504.07459)
Keywords: language model, gpt, llm, prompt, agent
Abstract: We propose a novel framework for generating causal graphs from narrative texts, bridging high-level causality and detailed event-specific relationships. Our method first extracts concise, agent-centered vertices using large language model (LLM)-based summarization. We introduce an "Expert Index," comprising seven linguistically informed features, integrated into a Situation-Task-Action-Consequence (STAC) classification model. This hybrid system, combining RoBERTa embeddings with the Expert Index, achieves superior precision in causal link identification compared to pure LLM-based approaches. Finally, a structured five-iteration prompting process refines and constructs connected causal graphs. Experiments on 100 narrative chapters and short stories demonstrate that our approach consistently outperforms GPT-4o and Claude 3.5 in causal graph quality, while maintaining readability. The open-source tool provides an interpretable, efficient solution for capturing nuanced causal chains in narratives.
摘要：我们提出了一个新颖的框架，用于从叙事文本中生成因果图，弥合高级因果关系和详细的事件特定关系。我们的方法首先使用基于大语言模型（LLM）的摘要提取简洁，以代理为中心的顶点。我们介绍了一个“专家索引”，其中包括七个语言知情的功能，并集成了一个情况任务 - 操作结果（STAC）分类模型。与基于LLM的方法相比，这种混合系统将Roberta嵌入与专家指数相结合，在因果链路识别中获得了优异的精度。最后，结构化的五介质提示过程完善并构造了连接的因果图。对100章和短篇小说的实验表明，我们的方法在因果图质量方面始终优于GPT-4O和Claude 3.5，同时保持可读性。开源工具为捕获叙事中细微的因果链提供了可解释，有效的解决方案。

Title: Defense against Prompt Injection Attacks via Mixture of Encodings

Authors: Ruiyi Zhang, David Sullivan, Kyle Jackson, Pengtao Xie, Mei Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07467
Pdf URL: https://arxiv.org/pdf/2504.07467
Copy Paste: [[2504.07467]] Defense against Prompt Injection Attacks via Mixture of Encodings(https://arxiv.org/abs/2504.07467)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have emerged as a dominant approach for a wide range of NLP tasks, with their access to external information further enhancing their capabilities. However, this introduces new vulnerabilities, known as prompt injection attacks, where external content embeds malicious instructions that manipulate the LLM's output. Recently, the Base64 defense has been recognized as one of the most effective methods for reducing success rate of prompt injection attacks. Despite its efficacy, this method can degrade LLM performance on certain NLP tasks. To address this challenge, we propose a novel defense mechanism: mixture of encodings, which utilizes multiple character encodings, including Base64. Extensive experimental results show that our method achieves one of the lowest attack success rates under prompt injection attacks, while maintaining high performance across all NLP tasks, outperforming existing character encoding-based defense methods. This underscores the effectiveness of our mixture of encodings strategy for both safety and task performance metrics.
摘要：大型语言模型（LLM）已成为多种NLP任务的主要方法，其访问外部信息进一步增强了其能力。但是，这引入了新的漏洞，称为及时注射攻击，外部内容嵌入了操纵LLM输出的恶意说明。最近，基本64防御被认为是降低快速进攻成功率的最有效方法之一。尽管具有功效，但这种方法可以在某些NLP任务上降低LLM的性能。为了应对这一挑战，我们提出了一种新颖的防御机制：编码的混合物，该机制利用了多个字符编码，包括base64。广泛的实验结果表明，我们的方法在迅速注射攻击下达到了最低的攻击成功率之一，同时保持所有NLP任务的高性能，表现优于现有的基于字符的防御方法。这强调了我们对安全性和任务绩效指标的编码策略混合的有效性。

Title: Transformer-Based Temporal Information Extraction and Application: A Review

Authors: Xin Su, Phillip Howard, Steven Bethard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07470
Pdf URL: https://arxiv.org/pdf/2504.07470
Copy Paste: [[2504.07470]] Transformer-Based Temporal Information Extraction and Application: A Review(https://arxiv.org/abs/2504.07470)
Keywords: language model
Abstract: Temporal information extraction (IE) aims to extract structured temporal information from unstructured text, thereby uncovering the implicit timelines within. This technique is applied across domains such as healthcare, newswire, and intelligence analysis, aiding models in these areas to perform temporal reasoning and enabling human users to grasp the temporal structure of text. Transformer-based pre-trained language models have produced revolutionary advancements in natural language processing, demonstrating exceptional performance across a multitude of tasks. Despite the achievements garnered by Transformer-based approaches in temporal IE, there is a lack of comprehensive reviews on these endeavors. In this paper, we aim to bridge this gap by systematically summarizing and analyzing the body of work on temporal IE using Transformers while highlighting potential future research directions.
摘要：时间信息提取（IE）旨在从非结构化文本中提取结构化的时间信息，从而发现内部的隐式时间表。该技术应用于医疗保健，新闻和情报分析等领域，这些领域有助于这些领域的模型执行时间推理，并使人用户能够掌握文本的时间结构。基于变压器的预训练的语言模型在自然语言处理方面产生了革命性的进步，表明了多种任务的出色表现。尽管暂时性IE中基于变压器的方法所取得的成就，但对这些努力缺乏全面的评论。在本文中，我们旨在通过系统地总结和分析使用变压器的时间上的工作体系来弥合这一差距，同时突出潜在的未来研究方向。

Title: Supervised Optimism Correction: Be Confident When LLMs Are Sure

Authors: Junjie Zhang, Rushuai Yang, Shunyu Liu, Ting-En Lin, Fei Huang, Yi Chen, Yongbin Li, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07527
Pdf URL: https://arxiv.org/pdf/2504.07527
Copy Paste: [[2504.07527]] Supervised Optimism Correction: Be Confident When LLMs Are Sure(https://arxiv.org/abs/2504.07527)
Keywords: language model, llm
Abstract: In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.
摘要：在这项工作中，我们建立了在令牌级别马尔可夫决策过程中受监督的微调和离线增强学习之间的新理论联系，表明大型语言模型确实学习了推理的隐含$ q $函数。通过这种理论镜头，我们证明了广泛使用的梁搜索方法遭受了不可接受的过度优势，在这种情况下，由于副本步骤的$ q $ - 价值估计，推理误差不可避免地会放大。为了解决这一限制，我们提出了监督的乐观校正（SOC），该校正引入了在受监督的微调过程中对令牌级别$ q $值估算的简单而有效的辅助损失。具体而言，辅助损失采用隐式价值正则化来增强对专家示范响应的模型信心，从而抑制了过度优化的对不足监督的响应。包括GSM8K，Math和Gaokao在内的数学推理基准测试基准的广泛实验，展示了拟议的SOC在一系列开源模型中使用横梁搜索的优越性。

Title: AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

Authors: Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.07532
Pdf URL: https://arxiv.org/pdf/2504.07532
Copy Paste: [[2504.07532]] AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation(https://arxiv.org/abs/2504.07532)
Keywords: language model, llm
Abstract: AI-generated text is proliferating across domains, from creative writing and journalism to marketing content and scientific articles. Models can follow user-provided instructions to generate coherent and grammatically correct outputs but in this work, we study a more fundamental question: how do we evaluate and improve the writing quality of AI-generated text? Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise. We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments. Our experiments show that competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ. We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy on the WQ benchmark. To further show WQRM's practical benefits during inference, we leverage additional test-time compute to generate and rank multiple candidate revisions, allowing us to select higher-quality outputs from an initial draft. Human evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point. We release our datasets and models to encourage community engagement with writing quality assessment and development of AI writing systems better aligned with human preferences.
摘要：从创意写作和新闻业到营销内容和科学文章，AI生成的文本正在跨领域增殖。模型可以遵循用户提供的说明来产生连贯和语法上正确的输出，但是在这项工作中，我们研究了一个更基本的问题：我们如何评估和提高AI生成的文本的写作质量？写作质量评估受到社区的关注较少，部分原因是它从根本上是主观的，需要专业知识。我们首先通过将五个写作质量数据集合并为4,729个写作质量判断，从而介绍写作质量基准（WQ）。我们的实验表明，竞争性基线，包括在推理任务上表现出色的最先进的LLM，在WQ上几乎不超过随机基线。然后，我们培训各种尺寸的专业写作质量奖励模型（WQRM），用于写作质量评估，这些质量评估表明在四个分发测试集上有强烈的概括，而WQ基准的精度为74％。为了进一步显示WQRM在推断期间的实际好处，我们利用额外的测试时间计算来生成和对多个候选修订，使我们能够从初始草案中选择更高质量的输出。与9位经验丰富的作家评估的人类评估证实，基于WQRM的选择会产生由专家偏爱的写作样本，总体66％，当奖励差距大于1分时，会产生72.2％。我们发布了我们的数据集和模型，以鼓励社区参与写作质量评估和AI写作系统的质量评估和开发，以更好地与人类的偏好保持一致。

Title: Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Authors: Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, André F.T. Martins, Graham Neubig
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.07583
Pdf URL: https://arxiv.org/pdf/2504.07583
Copy Paste: [[2504.07583]] Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering(https://arxiv.org/abs/2504.07583)
Keywords: llm
Abstract: Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic'' approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at this https URL
摘要：尽管机器翻译评估取得了稳步的进展，但现有的自动指标努力捕获句子范围内的含义的保留程度。我们认为，对模仿人类判断的单一内在质量评分的依赖可能不足以评估长，复杂段落的翻译，以及更需要评估需要在上下文中翻译来评估关键信息的准确信息的``务实''方法。我们介绍了TREQA（通过问答的翻译评估），该框架通过评估如何准确地评估候选人翻译的方式来外部评估翻译质量，以回答原始源或参考文本中关键信息的读取理解问题。在需要远程理解的具有挑战性的领域中，我们表明TREQA具有竞争力，并且在某些情况下，在排名替代段落级翻译方面，远远超过了最先进的神经和LLM的指标，尽管从未明确地优化与人类判断相关。此外，生成的问题和答案提供了解释性：经验分析表明，它们有效地针对评估数据集专家确定的翻译错误。我们的代码可在此HTTPS URL上找到

Title: ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models

Authors: Joel Barmettler, Abraham Bernstein, Luca Rossetto
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.07624
Pdf URL: https://arxiv.org/pdf/2504.07624
Copy Paste: [[2504.07624]] ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models(https://arxiv.org/abs/2504.07624)
Keywords: language model, gpt, llm, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past and recent advancements in Large Language Models (LLMs) have highlighted the importance of integrating world knowledge into these systems. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which is inefficient in terms of token usage. This paper introduces ConceptFormer, a new approach to augment LLMs with structured knowledge from KGs, such as Wikidata, without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting \emph{concept vectors} that encapsulate the information of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual recall capabilities of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in an efficient and scalable manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272\% when tested on sentences from Wikipedia and up to 348\% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213\% on Wikipedia sentences, significantly outperforming RAG with graph textification while consuming 130x fewer input tokens.
摘要：在最近的过去和大型语言模型中的最新进步（LLM）的最新进步强调了将世界知识纳入这些系统的重要性。当前的抹布方法通常会修改预训练的语言模型（PLM）的内部体系结构或依靠文本化知识图（kgs），这在令牌用法方面效率低下。本文介绍了ConceptFormer，这是一种增强LLM的新方法，具有来自Wikidata等KGS的结构化知识，而无需更改其内部结构或依赖KGS的文本输入。 ConceptFormer在LLM嵌入矢量空间中运行，创建和注入\ Emph {概念向量}，该\ emph {concept vectors}直接封装了kg节点的信息。与冷冻LLM一起培训，ConceptFormer生成了一个全面的查找表，将KG节点映射到其各自的概念矢量。该方法旨在通过使其能够在本地处理这些概念向量，从而以有效且可扩展的方式来增强它们，从而增强LLM的事实召回能力。我们的实验表明，在对Wikipedia的句子进行测试时，将概念向量添加到GPT-2 0.1B上大大提高了其事实召回能力（HIT@10），最高为272 \％。即使仅将单个概念向量注入提示中，在Wikipedia句子上，将事实召回能力（HIT@10）提高了213 \％，在消耗130倍的输入令牌的同时，大大优于图形文本化的抹布。

Title: On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

Authors: Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07646
Pdf URL: https://arxiv.org/pdf/2504.07646
Copy Paste: [[2504.07646]] On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data(https://arxiv.org/abs/2504.07646)
Keywords: language model, llm, tree-of-thought
Abstract: The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.
摘要：大语言模型（LLM）在培训期间不存在的数据中的时间推理任务中的适用性仍然是待探索的领域。在本文中，我们致力于此主题，重点关注结构化和半结构化匿名数据。我们不仅开发了直接的LLM管道，而且还比较了各种方法并进行了深入的分析。我们以自然语言确定并检查了十七个常见的时间推理任务，重点是其算法组成部分。为了评估LLM性能，我们创建了\ textIt {推理和回答时间能力}数据集（RATA），该数据集（RATA）具有半结构化的匿名数据，以确保依赖推理而不是先验知识。我们比较了几种涉及SOTA技术的方法，例如针对这种情况的思想树，自我反弹和代码执行。我们的结果表明，实现可扩展和可靠的解决方案不仅需要独立的LLM，还需要集成方法。

Title: Unveiling the Impact of Multimodal Features on Chinese Spelling Correction: From Analysis to Design

Authors: Xiaowu Zhang, Hongfei Zhao, Jingyi Hou, Zhijie Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07661
Pdf URL: https://arxiv.org/pdf/2504.07661
Copy Paste: [[2504.07661]] Unveiling the Impact of Multimodal Features on Chinese Spelling Correction: From Analysis to Design(https://arxiv.org/abs/2504.07661)
Keywords: language model, llm
Abstract: The Chinese Spelling Correction (CSC) task focuses on detecting and correcting spelling errors in sentences. Current research primarily explores two approaches: traditional multimodal pre-trained models and large language models (LLMs). However, LLMs face limitations in CSC, particularly over-correction, making them suboptimal for this task. While existing studies have investigated the use of phonetic and graphemic information in multimodal CSC models, effectively leveraging these features to enhance correction performance remains a challenge. To address this, we propose the Multimodal Analysis for Character Usage (\textbf{MACU}) experiment, identifying potential improvements for multimodal correctison. Based on empirical findings, we introduce \textbf{NamBert}, a novel multimodal model for Chinese spelling correction. Experiments on benchmark datasets demonstrate NamBert's superiority over SOTA methods. We also conduct a comprehensive comparison between NamBert and LLMs, systematically evaluating their strengths and limitations in CSC. Our code and model are available at this https URL.
摘要：中国拼写校正（CSC）任务着重于检测和纠正句子中的拼写错误。当前的研究主要探讨了两种方法：传统的多模式预训练模型和大型语言模型（LLMS）。但是，LLMS在CSC中面临限制，尤其是过度校正，使其在此任务中均优美。尽管现有研究研究了在多模式CSC模型中使用语音和图形信息的使用，但有效利用这些特征来增强校正性能仍然是一个挑战。为了解决这个问题，我们提出了针对字符使用情况（\ textbf {MACU}）实验的多模式分析，从而确定了多模式校正的潜在改进。根据经验发现，我们介绍了\ textbf {nambert}，这是一种用于中国拼写校正的新型多模式模型。基准数据集上的实验证明了Nambert对SOTA方法的优越性。我们还对Nambert和LLMS进行了全面比较，从系统地评估了它们在CSC中的优势和局限性。我们的代码和模型可在此HTTPS URL上找到。

Title: Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations

Authors: Sheila Castilho, Zoe Fitzsimmons, Claire Holton, Aoife Mc Donagh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07680
Pdf URL: https://arxiv.org/pdf/2504.07680
Copy Paste: [[2504.07680]] Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations(https://arxiv.org/abs/2504.07680)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: This study examines hallucinations in Large Language Model (LLM) translations into Irish, specifically focusing on instances where the models generate novel, non-existent words. We classify these hallucinations within verb and noun categories, identifying six distinct patterns among the latter. Additionally, we analyse whether these hallucinations adhere to Irish morphological rules and what linguistic tendencies they exhibit. Our findings show that while both GPT-4.o and GPT-4.o Mini produce similar types of hallucinations, the Mini model generates them at a significantly higher frequency. Beyond classification, the discussion raises speculative questions about the implications of these hallucinations for the Irish language. Rather than seeking definitive answers, we offer food for thought regarding the increasing use of LLMs and their potential role in shaping Irish vocabulary and linguistic evolution. We aim to prompt discussion on how such technologies might influence language over time, particularly in the context of low-resource, morphologically rich languages.
摘要：这项研究研究了大语模型（LLM）翻译成爱尔兰的幻觉，特别关注模型产生新颖的，不存在的单词的实例。我们将这些幻觉分类为动词和名词类别，从而确定了后者之间的六种不同模式。此外，我们分析了这些幻觉是否遵守爱尔兰形态规则以及它们表现出的语言倾向。我们的发现表明，尽管GPT-4.O和GPT-4.O迷你都产生相似类型的幻觉，但Mini模型以明显更高的频率生成它们。除了分类之外，讨论还提出了有关这些幻觉对爱尔兰语言的影响的投机性问题。我们没有寻求确定的答案，而是为越来越多的LLM使用及其在塑造爱尔兰词汇和语言进化中的潜在作用提供思考的食物。我们的目标是促使讨论此类技术如何随着时间的流逝影响语言，尤其是在低资源，形态丰富的语言的背景下。

Title: Proactive User Information Acquisition via Chats on User-Favored Topics

Authors: Shiki Sato, Jun Baba, Asahi Hentona, Shinji Iwata, Akifumi Yoshimoto, Koichiro Yoshino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07698
Pdf URL: https://arxiv.org/pdf/2504.07698
Copy Paste: [[2504.07698]] Proactive User Information Acquisition via Chats on User-Favored Topics(https://arxiv.org/abs/2504.07698)
Keywords: language model, llm, chat
Abstract: Chat-oriented dialogue systems designed to provide tangible benefits, such as sharing the latest news or preventing frailty in senior citizens, often require Proactive acquisition of specific user Information via chats on user-faVOred Topics (PIVOT). This study proposes the PIVOT task, designed to advance the technical foundation for these systems. In this task, a system needs to acquire the answers of a user to predefined questions without making the user feel abrupt while engaging in a chat on a predefined topic. We found that even recent large language models (LLMs) show a low success rate in the PIVOT task. We constructed a dataset suitable for the analysis to develop more effective systems. Finally, we developed a simple but effective system for this task by incorporating insights obtained through the analysis of this dataset.
摘要：面向聊天的对话系统旨在提供切实的好处，例如分享最新消息或预防老年人的脆弱性，通常需要通过有关用户最喜欢的主题（Pivot）的聊天来主动地掌握特定的用户信息。这项研究提出了枢轴任务，旨在推动这些系统的技术基础。在此任务中，系统需要获取用户的答案以预定义的问题，而不会使用户在对预定义主题进行聊天时突然感到突然。我们发现，即使最近的大型语言模型（LLM）在枢轴任务中的成功率较低。我们构建了一个适合分析的数据集，以开发更有效的系统。最后，我们通过合并通过对该数据集的分析获得的洞察力开发了一个简单但有效的系统。

Title: MRD-RAG: Enhancing Medical Diagnosis with Multi-Round Retrieval-Augmented Generation

Authors: Yixiang Chen, Penglei Sun, Xiang Li, Xiaowen Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07724
Pdf URL: https://arxiv.org/pdf/2504.07724
Copy Paste: [[2504.07724]] MRD-RAG: Enhancing Medical Diagnosis with Multi-Round Retrieval-Augmented Generation(https://arxiv.org/abs/2504.07724)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: In recent years, accurately and quickly deploying medical large language models (LLMs) has become a significant trend. Among these, retrieval-augmented generation (RAG) has garnered significant attention due to its features of rapid deployment and privacy protection. However, existing medical RAG frameworks still have shortcomings. Most existing medical RAG frameworks are designed for single-round question answering tasks and are not suitable for multi-round diagnostic dialogue. On the other hand, existing medical multi-round RAG frameworks do not consider the interconnections between potential diseases to inquire precisely like a doctor. To address these issues, we propose a Multi-Round Diagnostic RAG (MRD-RAG) framework that mimics the doctor's diagnostic process. This RAG framework can analyze diagnosis information of potential diseases and accurately conduct multi-round diagnosis like a doctor. To evaluate the effectiveness of our proposed frameworks, we conduct experiments on two modern medical datasets and two traditional Chinese medicine datasets, with evaluations by GPT and human doctors on different methods. The results indicate that our RAG framework can significantly enhance the diagnostic performance of LLMs, highlighting the potential of our approach in medical diagnosis. The code and data can be found in our project website this https URL.
摘要：近年来，准确，迅速地部署医学大语言模型（LLM）已成为一个重大趋势。其中，由于其快速部署和隐私保护的特征，检索型的一代（RAG）引起了极大的关注。但是，现有的医疗抹布框架仍然存在缺点。大多数现有的医学RAG框架都是为单轮问答任务而设计的，不适合多轮诊断对话。另一方面，现有的医学多发抹布框架并不认为潜在疾病之间的互连可以像医生一样准确地询问。为了解决这些问题，我们提出了一个模拟医生诊断过程的多轮诊断抹布（MRD-rag）框架。该破布框架可以分析潜在疾病的诊断信息，并像医生一样准确地进行多轮诊断。为了评估我们提出的框架的有效性，我们对两个现代医疗数据集和两个中医数据集进行了实验，并由GPT和人类医生对不同方法进行了评估。结果表明，我们的破布框架可以显着提高LLM的诊断性能，从而强调我们在医学诊断中的潜力。代码和数据可以在我们的项目网站上找到此HTTPS URL。

Title: DeepGreen: Effective LLM-Driven Green-washing Monitoring System Designed for Empirical Testing -- Evidence from China

Authors: Congluo Xu, Yu Miao, Yiling Xiao, Chengmengjia Lin
Subjects: cs.CL, econ.GN
Abstract URL: https://arxiv.org/abs/2504.07733
Pdf URL: https://arxiv.org/pdf/2504.07733
Copy Paste: [[2504.07733]] DeepGreen: Effective LLM-Driven Green-washing Monitoring System Designed for Empirical Testing -- Evidence from China(https://arxiv.org/abs/2504.07733)
Keywords: language model, llm
Abstract: This paper proposes DeepGreen, an Large Language Model Driven (LLM-Driven) system for detecting corporate green-washing behaviour. Utilizing dual-layer LLM analysis, DeepGreen preliminarily identifies potential green keywords in financial statements and then assesses their implementation degree via iterative semantic analysis of LLM. A core variable GreenImplement is derived from the ratio from the two layers' output. We extract 204 financial statements of 68 companies from A-share market over three years, comprising 89,893 words, and analyse them through DeepGreen. Our analysis, supported by violin plots and K-means clustering, reveals insights and validates the variable against the Huazheng ESG rating. It offers a novel perspective for regulatory agencies and investors, serving as a proactive monitoring tool that complements traditional this http URL tests show that green implementation can significantly boost the asset return rate of companies, but there is heterogeneity in scale. Small and medium-sized companies have limited contribution to asset return via green implementation, so there is a stronger motivation for green-washing.
摘要：本文提出了DeepGreen，这是一种大型语言模型驱动（LLM驱动）系统，用于检测公司的绿化行为。使用双层LLM分析，DeepGreen先初步识别财务报表中的潜在绿色关键字，然后通过LLM的迭代语义分析评估其实施程度。核心变量的绿色单元源于从两个层的输出的比率得出。我们在三年内提取了来自A股市场的68家公司的204个财务报表，其中包含89,893个单词，并通过Deepgreen分析它们。我们的分析在小提琴图和K-均值聚类的支持下，揭示了有关Huazheng ESG等级的变量的见解和验证。它为监管机构和投资者提供了一种新颖的看法，它是一种主动监测工具，可以补充传统的HTTP URL测试，表明绿色实施可以显着提高公司资产回报率，但规模上存在异质性。中小型公司通过绿色实施对资产返回的贡献有限，因此有更强大的绿色动力。

Title: Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

Authors: A. Loreti, K. Chen, R. George, R. Firth, A. Agnello, S. Tanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07738
Pdf URL: https://arxiv.org/pdf/2504.07738
Copy Paste: [[2504.07738]] Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information(https://arxiv.org/abs/2504.07738)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf's law, which characterizes human-generated natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that combines large language models with a multi-prompt approach. This system provides contextually relevant answers to natural-language queries, including complex multi-hop questions that require reasoning across interconnected entities.
摘要：在本文档中，我们讨论了一种自动构造知识图的多步方法，以构建和代表大型文档Corpora的特定领域知识。我们应用我们的方法来构建核融合能的第一个知识图，这是一个高度专业的领域，其特征是范围巨大和异质性。这是测试管道关键特征的理想基准，包括自动命名实体识别和实体分辨率。我们展示了如何使用预训练的大型语言模型来应对这些挑战，并根据ZIPF定律评估了它们的绩效，这是人类生成的自然语言的特征。此外，我们开发了一个知识记录检索的生成系统，该系统将大型语言模型与多项目的方法相结合。该系统为自然语言查询提供了上下文相关的答案，包括需要跨互连实体推理的复杂多跳跃问题。

Title: NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

Authors: Vladislav Mikhailov, Tita Enstad, David Samuel, Hans Christian Farsethås, Andrey Kutuzov, Erik Velldal, Lilja Øvrelid
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07749
Pdf URL: https://arxiv.org/pdf/2504.07749
Copy Paste: [[2504.07749]] NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark(https://arxiv.org/abs/2504.07749)
Keywords: language model, prompt
Abstract: This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets -- of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokmål and Nynorsk. All our datasets and a collection of over 100 human-written prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pre-trained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.
摘要：本文介绍了Noreval，这是一个新的全面评估套件，用于大规模的挪威生成语言模型（LMS）的标准化基准测试。 Noreval由24个高质量的人类创建的数据集组成 - 其中5个是从头开始创建的。与挪威语的现有基准相反，Noreval涵盖了针对挪威语言理解和产生，建立人类基线的广泛的任务类别，并专注于挪威语的官方书面标准：Bokmål和Nynorsk。我们所有的数据集和100多个人工编写的提示集集成了LM评估线束，以确保灵活和可重现的评估。我们描述了Noreval的设计，并在各种情况下为Norwegian进行了基准测试的19个开源和指导调节的LMS的结果。我们的基准，评估框架和注释材料公开可用。

Title: Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation

Authors: Bo Zhang, Hui Ma, Dailin Li, Jian Ding, Jian Wang, Bo Xu, HongFei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07754
Pdf URL: https://arxiv.org/pdf/2504.07754
Copy Paste: [[2504.07754]] Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation(https://arxiv.org/abs/2504.07754)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate remarkable text comprehension and generation capabilities but often lack the ability to utilize up-to-date or domain-specific knowledge not included in their training data. To address this gap, we introduce KEDiT, an efficient method for fine-tuning LLMs for knowledge-grounded dialogue generation. KEDiT operates in two main phases: first, it employs an information bottleneck to compress retrieved knowledge into learnable parameters, retaining essential information while minimizing computational overhead. Second, a lightweight knowledge-aware adapter integrates these compressed knowledge vectors into the LLM during fine-tuning, updating less than 2\% of the model parameters. The experimental results on the Wizard of Wikipedia and a newly constructed PubMed-Dialog dataset demonstrate that KEDiT excels in generating contextually relevant and informative responses, outperforming competitive baselines in automatic, LLM-based, and human evaluations. This approach effectively combines the strengths of pretrained LLMs with the adaptability needed for incorporating dynamic knowledge, presenting a scalable solution for fields such as medicine.
摘要：大型语言模型（LLMS）表现出了非凡的文本理解和发电能力，但通常缺乏使用其培训数据中未包含的最新知识或特定于领域的知识的能力。为了解决这一差距，我们介绍了Kedit，这是一种为知识接地对话生成的微调LLM的有效方法。 Kedit分为两个主要阶段运行：首先，它采用信息瓶颈来将检索到的知识压缩为可学习的参数，保留基本信息，同时最大程度地减少计算开销。其次，轻巧的知识感知适配器在微调过程中将这些压缩知识向量整合到LLM中，更新了少于模型参数的2 \％。 Wikipedia向导和新建造的PubMed-Dialog数据集的实验结果表明，Kedit在产生上下文相关和信息性的响应方面表现出色，在自动，基于LLM和人类评估中表现优于竞争性基线。这种方法有效地结合了预验证的LLM的优势与合并动态知识所需的适应性，为医学等田间提供了可扩展的解决方案。

Title: Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation

Authors: Alireza Salemi, Chris Samarinas, Hamed Zamani
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.07794
Pdf URL: https://arxiv.org/pdf/2504.07794
Copy Paste: [[2504.07794]] Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation(https://arxiv.org/abs/2504.07794)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.
摘要：本文研究了（检索型）大语言模型（LLMS）在产生多样化和全面的响应中的局限性，并基于两个相系统的设计介绍了计划和refine（P＆R）框架。在全球勘探阶段，P＆R为给定输入生成了一套多种计划，每个计划都包含各种查询方面的列表，并具有相应的其他描述。此阶段之后是本地剥削阶段，该阶段为每个计划条件的输入查询生成了响应建议，并迭代地完善了提高提案质量的建议。最后，采用奖励模型来选择最高的事实和覆盖范围的提案。我们基于ICAT评估方法进行实验 - 一种最新的回答事实和全面评估的方法。从非事实问题答案和TREC搜索结果多元化任务中采用的两个不同信息的实验，这表明，P＆R的表现明显优于基线，在古董数据集上取得了13.1％的改善，而TREC数据集则提高了15.41％的改善。此外，一项较小的用户研究证实了P＆R框架的实质性功效。

Title: A System for Comprehensive Assessment of RAG Frameworks

Authors: Mattia Rengo, Senad Beadini, Domenico Alfano, Roberto Abbruzzese
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.07803
Pdf URL: https://arxiv.org/pdf/2504.07803
Copy Paste: [[2504.07803]] A System for Comprehensive Assessment of RAG Frameworks(https://arxiv.org/abs/2504.07803)
Keywords: language model, llm, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. To address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Moreover, SCARF integrates practical considerations such as response coherence, providing a scalable and adaptable solution for researchers and industry professionals evaluating RAG applications. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.
摘要：检索增强发电（RAG）已成为通过集成检索机制来增强大语言模型（LLM）的事实准确性和上下文相关性的标准范式。但是，现有的评估框架未能提供整体黑盒方法来评估抹布系统，尤其是在现实世界部署方案中。为了解决这一差距，我们介绍了围巾（用于RAG框架的全面评估系统），这是一个模块化且灵活的评估框架，旨在系统地进行基准部署的抹布应用程序。围巾提供了一种端到端的黑盒评估方法，从而实现了各种抹布框架的限量比较。我们的框架支持多种部署配置，并促进了跨矢量数据库和LLM服务策略的自动测试，从而生成详细的绩效报告。此外，围巾还整合了诸如响应连贯性之类的实际考虑因素，为评估破布应用的研究人员和行业专业人员提供了可扩展且可适应性的解决方案。使用REST API接口，我们演示了如何将围巾应用于实际场景，展示其在评估不同的抹布框架和配置方面的灵活性。围巾可在GitHub存储库中找到。

Title: Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models

Authors: Hongcheng Guo, Juntao Yao, Boyang Wang, Junjia Du, Shaosheng Cao, Donglin Di, Shun Zhang, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07807
Pdf URL: https://arxiv.org/pdf/2504.07807
Copy Paste: [[2504.07807]] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models(https://arxiv.org/abs/2504.07807)
Keywords: language model, gpt, llm
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Expert Pruning (C-Prune), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-Prune through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-Prune effectively reduces model size while outperforming existing MoE pruning methods.
摘要：专家（MOE）架构的混合物已经成为一种有希望的范式，用于扩展大型语言模型（LLMS），以稀疏的特定于任务专家的激活。尽管在推断过程中它们的计算效率，但MOE模型的庞大总体参数足迹（例如GPT-4）引入了实践部署的关键挑战。当前的修剪方法通常无法解决MOE系统的两个固有特征：1）.intra-Layer专家同质性，其中同一MOE层中的专家表现出功能冗余，而2）。较深层倾向于逐渐包含更多均匀专家的层间相似性模式。为了解决这些问题，我们提出了集群驱动的专家修剪（C-Prune），这是一个新颖的两阶段框架，用于自适应特定于任务的MOE LLMS。 c-prune通过层次专家聚类运行，这些集群使用参数相似性指标在每个MOE层中在功能上相似的专家，然后进行全局群集修剪，从而消除了通过统一的重要性评分机制在所有层上消除了跨层同质性的统一重要性评分机制。我们通过对多个MOE模型和基准测试的广泛实验来验证C-Prune。结果表明，C-Prune可以有效地减少模型大小，同时胜过现有的MOE修剪方法。

Title: What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Authors: Pavel Chizhov, Mattia Nee, Pierre-Carl Langlais, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07825
Pdf URL: https://arxiv.org/pdf/2504.07825
Copy Paste: [[2504.07825]] What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks(https://arxiv.org/abs/2504.07825)
Keywords: language model, prompt
Abstract: Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.
摘要：常识性推理是一种关键的语言模型能力，因为它不仅封装了特定的事实知识，还封装了一般语言和世界理解。因此，测量常识性推理对于不同大小和应用的语言模型至关重要。 Hellaswag是评估此类功能的最广泛使用的基准之一。但是，在本文中，我们表明它存在严重的构造有效性问题。这些问题范围从基本的非语法和众多错别字到误导提示或同样正确的选项。此外，我们表明，如果仅在答案文本或“ lorem ipsum dolor ...”中评估模型，而不是问题，则超过65％的模型预测保持不变，这不能仅仅归因于污染。由于基准分数是研究和商业应用中模型选择的重要组成部分，因此这些有效性问题可能会带来严重的后果。特别是，知道以面值的基准分数无处不在，评估不足会导致对模型的明智性决策。在本文中，我们彻底研究了Hellaswag提出的关键有效性问题，并使用不同尺寸的生成语言模型通过各种评估来说明它们。我们认为，该基准不能准确测量常识性推理，因此不应在其当前状态下进行评估。根据我们的研究结果，我们提出的要求应通过未来常识性推理基准来满足。此外，我们释放了GoldensWag，这是Hellaswag的校正子集，据我们的信念，它促进了可接受的常识性推理评估。

Title: MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations

Authors: Genglin Liu, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, Saadia Gabriel
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2504.07830
Pdf URL: https://arxiv.org/pdf/2504.07830
Copy Paste: [[2504.07830]] MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations(https://arxiv.org/abs/2504.07830)
Keywords: llm, agent
Abstract: We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents' articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open-source our simulation software to encourage further research within AI and social sciences.
摘要：我们提出了一个新颖的开源社交网络模拟框架，即马赛克，生成语言代理可以预测用户行为，例如喜欢，共享和标记内容。该模拟将LLM代理与有导的社交图相结合，以分析新兴的欺骗行为，并更好地了解用户如何确定在线社交内容的真实性。通过从各种细粒度的角色构建用户表示，我们的系统启用了多代理模拟，以大规模建模内容传播和参与动态。在此框架内，我们通过模拟的错误信息传播评估了三种不同的内容审核策略，我们发现它们不仅可以减轻非事实内容的传播，还可以增加用户参与度。此外，我们在模拟中分析了流行内容的轨迹，并探讨了模拟代理人为其社交互动的表达推理是否确实与他们的集体参与模式保持一致。我们开源的仿真软件，以鼓励在AI和社会科学中进行进一步的研究。

Title: The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Authors: Michael J Bommarito II, Jillian Bommarito, Daniel Martin Katz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07854
Pdf URL: https://arxiv.org/pdf/2504.07854
Copy Paste: [[2504.07854]] The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models(https://arxiv.org/abs/2504.07854)
Keywords: language model
Abstract: Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.
摘要：实际上，所有大型语言模型均已预先培训，该数据受到与侵犯版权和违反合同有关的全球不确定性。由于这种不确定的法律地位，这为用户和开发人员造成了潜在的风险。 KL3M数据项目通过引入最大的综合培训数据管道，直接面对这一关键问题，该数据管道最大程度地减少了与版权或违反合同有关的风险。该项目的基础是超过1.32亿个文档和数万亿个代币的语料库，这些文档涵盖了16个不同来源，这些资料已经过验证，以满足本文详细介绍的严格版权和许可协议。我们正在发布整个管道，包括1）以相关的出处和元数据为原始文档格式，3）以标准化格式提取内容，4）文档的预授予的内容，以及5）各种中和后的资源，例如问答，求和，汇总，conversioniation，corversion，prafftting，confraftting，convection，and confactifection，and confectife and Traffting，confactifection，and Concection，and Concection，and Concection，and Concection，以及对话，并对话。所有这些资源都可以在S3，拥抱面孔和GitHub的情况下免费获得，并根据CC逐项提供。我们致力于继续该项目，以促进对AI模型的开发和使用更具道德，法律和可持续的方法。

Title: Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

Authors: Yichun Yin, Wenyong Huang, Kaikai Song, Yehui Tang, Xueyu Wu, Wei Guo, Peng Guo, Yaoyuan Wang, Xiaojun Meng, Yasheng Wang, Dong Li, Can Chen, Dandan Tu, Yin Li, Fisher Yu, Ruiming Tang, Yunhe Wang, Baojun Wang, Bin Wang, Bo Wang, Boxiao Liu, Changzheng Zhang, Duyu Tang, Fei Mi, Hui Jin, Jiansheng Wei, Jiarui Qin, Jinpeng Li, Jun Zhao, Liqun Deng, Lin Li, Minghui Xu, Naifu Zhang, Nianzu Zheng, Qiang Li, Rongju Ruan, Shengjun Cheng, Tianyu Guo, Wei He, Wei Li, Weiwen Liu, Wulong Liu, Xinyi Dai, Yonghan Dong, Yu Pan, Yue Li, Yufei Wang, Yujun Li, Yunsheng Ni, Zhe Liu, Zhenhe Zhang, Zhicheng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07866
Pdf URL: https://arxiv.org/pdf/2504.07866
Copy Paste: [[2504.07866]] Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs(https://arxiv.org/abs/2504.07866)
Keywords: language model, llm
Abstract: We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.
摘要：我们提出了Pangu ultra，这是一种大型语言模型（LLM），其中有1350亿个参数和在上升神经处理单元（NPU）训练的密集变压器模块。尽管LLM领域在近年来在推动LLM的规模和能力方面一直在见证了前所未有的进步，但培训这样的大规模模型仍然涉及重大优化和系统挑战。为了稳定训练过程，我们提出了深度缩放的三明治归一化，这在深层模型的训练过程中有效消除了损失尖峰。我们将模型预先培训13.2万亿多种和高质量的代币，并进一步增强其在培训期间的推理能力。为了有效地进行如此大规模的培训，我们利用了8,192个升至NPU，并具有一系列的系统优化。对多种基准测试的评估表明，Pangu Ultra显着提高了密集的LLM的最新能力，例如Llama 405b和Mismtral groun 2，甚至还可以通过DeepSeek-R1实现竞争结果，DeepSeek-R1的稀疏模型结构包含更多的参数。我们的探索表明，Ascend NPU能够具有超过1000亿参数的有效训练密集的模型。我们的模型和系统将为我们的商业客户提供。

Title: Token Level Routing Inference System for Edge Devices

Authors: Jianshu She, Wenhao Zheng, Zhengzhong Liu, Hongyi Wang, Eric Xing, Huaxiu Yao, Qirong Ho
Subjects: cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2504.07878
Pdf URL: https://arxiv.org/pdf/2504.07878
Copy Paste: [[2504.07878]] Token Level Routing Inference System for Edge Devices(https://arxiv.org/abs/2504.07878)
Keywords: language model, llm, hallucination
Abstract: The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
摘要：大语言模型（LLM）推断的计算复杂性显着限制了其在边缘设备上的部署效率。相比之下，小语言模型可提供更快的解码和较低的资源消耗，但经常患有降解的响应质量和对幻觉的敏感性增强。为了解决这一权衡，合作解码，其中大型模型有助于产生关键令牌，已成为一个有前途的解决方案。该范式通过通过选择性干预大型模型来实现高质量的推断，同时保持较小模型的速度和效率，从而利用了两种模型类型的优势。在这项工作中，我们提出了一种新颖的协作解码推理系统，该系统允许小型模型在选择性咨询基于云的大型模型以进行关键令牌生成的同时执行设备推理。值得注意的是，该系统仅使用M1 MacBook上的0.5B模型在CommonSensenSQA上获得60％的性能增长，而代币的7％生成上传到云中的大型模型。

Title: Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Authors: Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, Domenico Talia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07887
Pdf URL: https://arxiv.org/pdf/2504.07887
Copy Paste: [[2504.07887]] Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge(https://arxiv.org/abs/2504.07887)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. However, their increasing integration into critical societal domains has raised concerns about embedded biases, which can perpetuate stereotypes and compromise fairness. These biases stem from various sources, including historical inequalities in training data, linguistic imbalances, and adversarial manipulation. Despite mitigation efforts, recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation. Our methodology involves (i) systematically probing models with a multi-task approach targeting biases across various sociocultural dimensions, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach for automated assessment of model responses, and (iii) employing jailbreak techniques to investigate vulnerabilities in safety mechanisms. Our analysis examines prevalent biases in both small and large state-of-the-art models and their impact on model safety. Additionally, we assess the safety of domain-specific models fine-tuned for critical fields, such as medicine. Finally, we release a curated dataset of bias-related prompts, CLEAR-Bias, to facilitate systematic vulnerability benchmarking. Our findings reveal critical trade-offs between model size and safety, aiding the development of fairer and more robust future language models.
摘要：大型语言模型（LLM）彻底改变了人工智能，推动机器翻译，摘要和对话剂的进步。但是，它们不断增加与关键社会领域的融合引起了人们对嵌入式偏见的关注，这些偏见可以使刻板印象永久化并妥协公平。这些偏见源于各种来源，包括训练数据中的历史不平等，语言失衡和对抗性操纵。尽管进行了缓解措施，但最近的研究表明，LLM仍然容易受到旨在引起偏见反应的对抗攻击的影响。这项工作提出了一个可扩展的基准测试框架，以评估LLM稳健性，以抗体偏置启发。我们的方法涉及（i）采用多任务方法的系统探测模型，该模型靶向各种社会文化维度的偏见，（ii）使用LLM-AS-A-A-A-Gudge方法通过安全得分来量化鲁棒性，以自动评估模型反应的自动评估，以及（iii）采用越狱技术来调查安全机能中的透明度。我们的分析研究了小型和最新模型中的普遍偏见及其对模型安全性的影响。此外，我们评估了针对关键领域（例如医学）微调的域特异性模型的安全性。最后，我们发布了一个策划的偏见有关提示的数据集，即明确的偏见，以促进系统的漏洞基准测试。我们的发现揭示了模型规模和安全性之间的重要权衡，从而有助于开发更公平，更健壮的未来语言模型。

Title: Redefining Machine Translation on Social Network Services with Large Language Models

Authors: Hongcheng Guo, Fei Zhao, Shaosheng Cao, Xinze Lyu, Ziyan Liu, Yue Wang, Boyang Wang, Zhoujun Li, Chonggang Lu, Zhe Xu, Yao Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07901
Pdf URL: https://arxiv.org/pdf/2504.07901
Copy Paste: [[2504.07901]] Redefining Machine Translation on Social Network Services with Large Language Models(https://arxiv.org/abs/2504.07901)
Keywords: language model, llm
Abstract: The globalization of social interactions has heightened the need for machine translation (MT) on Social Network Services (SNS), yet traditional models struggle with culturally nuanced content like memes, slang, and pop culture references. While large language models (LLMs) have advanced general-purpose translation, their performance on SNS-specific content remains limited due to insufficient specialized training data and evaluation benchmarks. This paper introduces RedTrans, a 72B LLM tailored for SNS translation, trained on a novel dataset developed through three innovations: (1) Supervised Finetuning with Dual-LLM Back-Translation Sampling, an unsupervised sampling method using LLM-based back-translation to select diverse data for large-scale finetuning; (2) Rewritten Preference Optimization (RePO), an algorithm that identifies and corrects erroneous preference pairs through expert annotation, building reliable preference corpora; and (3) RedTrans-Bench, the first benchmark for SNS translation, evaluating phenomena like humor localization, emoji semantics, and meme adaptation. Experiments show RedTrans outperforms state-of-the-art LLMs. Besides, RedTrans has already been deployed in a real-world production environment, demonstrating that domain-specific adaptation, effectively bridges the gap between generic and culturally grounded translation systems.
摘要：社会互动的全球化增强了对社交网络服务（SNS）的机器翻译（MT）的需求，但是传统模型在文化上有细微的内容（例如模因，s语和流行文化参考）挣扎。尽管大型语言模型（LLMS）具有高级通用翻译，但由于专业培训数据和评估基准不足，它们在SNS特定内容上的性能仍然有限。本文介绍了RedTrans，这是一种针对SNS翻译的72B LLM，在通过三项创新开发的小说数据集中培训：（1）使用Dual-LllM背面换算采样进行监督的登录，这是一种使用LLM基于LLM的回购方法，以选择多样化的数据，以选择大型数据，以供大型列表列出大型列表，（2）重写偏好优化（repo），一种算法，通过专家注释来识别和纠正错误的偏好对，建立可靠的偏好语料库；（3）RedTrans-Bench，这是SNS翻译的第一个基准，评估了幽默定位，表情符号语义和模因适应等现象。实验表明，红色Trans的表现优于最先进的LLM。此外，RedTrans已经部署在现实世界中的生产环境中，表明特定于域的适应性有效地弥合了通用和文化基础翻译系统之间的差距。