2025-04-22

Title: MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Authors: Jaime Raldua Veuthey, Zainab Ali Majid, Suhas Hariharan, Jacob Haimes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14039
Pdf URL: https://arxiv.org/pdf/2504.14039
Copy Paste: [[2504.14039]] MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks(https://arxiv.org/abs/2504.14039)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.
摘要：随着大型语言模型（LLMS）的发展，它们对广泛的社会影响的潜力同时增长。因此，严格的LLM评估既是技术的必要性，又是社会当务之急。尽管已经开发了许多评估基准，但是元评估中仍然存在一个危险的差距：有效评估基准测试质量。我们提出了MEQA，这是对问题和答案（QA）基准进行元评估的框架，以提供标准化的评估，可量化的分数，并启用有意义的基础基准比较。我们使用人类和LLM评估者展示了这种在网络安全基准上的方法，突出了基准的优势和劣势。我们通过AI模型的双重性质作为强大的防御工具和安全威胁来激励我们选择测试域。

Title: A Baseline for Self-state Identification and Classification in Mental Health Data: CLPsych 2025 Task

Authors: Laerdon Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14066
Pdf URL: https://arxiv.org/pdf/2504.14066
Copy Paste: [[2504.14066]] A Baseline for Self-state Identification and Classification in Mental Health Data: CLPsych 2025 Task(https://arxiv.org/abs/2504.14066)
Keywords: language model, llm
Abstract: We present a baseline for the CLPsych 2025 A.1 task: classifying self-states in mental health data taken from Reddit. We use few-shot learning with a 4-bit quantized Gemma 2 9B model and a data preprocessing step which first identifies relevant sentences indicating self-state evidence, and then performs a binary classification to determine whether the sentence is evidence of an adaptive or maladaptive self-state. This system outperforms our other method which relies on an LLM to highlight spans of variable length independently. We attribute the performance of our model to the benefits of this sentence chunking step for two reasons: partitioning posts into sentences 1) broadly matches the granularity at which self-states were human-annotated and 2) simplifies the task for our language model to a binary classification problem. Our system places third out of fourteen systems submitted for Task A.1, achieving a test-time recall of 0.579.
摘要：我们提供了CLPSych 2025 A.1任务的基准：对从Reddit获取的心理健康数据中的自我分类。我们使用4位量化的Gemma 2 9b模型和数据预处理步骤使用的几乎没有射击的学习，该步骤首先识别指示自我证据的相关句子，然后执行二进制分类以确定句子是否是适应性或不良自适应自我自适应自我的证据。该系统的表现优于我们的其他方法，该方法依靠LLM独立突出显示可变长度的跨度。我们将模型的性能归因于该句子块的好处，原因有两个：将帖子分配到句子中1）广泛地匹配了自称为人类自称的粒度，并且2）将我们的语言模型的任务简化为二进制分类问题。我们的系统将提交任务A.1提交的十四个系统中的第三位，实现了0.579的测试时间召回。

Title: LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

Authors: Kang He, Kaushik Roy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14089
Pdf URL: https://arxiv.org/pdf/2504.14089
Copy Paste: [[2504.14089]] LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models(https://arxiv.org/abs/2504.14089)
Keywords: language model, gpt, llm, chain-of-thought, tree-of-thought
Abstract: Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.
摘要：大型语言模型（LLM）已在各个领域中获得了非凡的多步推理能力。但是，LLM在复杂的逻辑推理中仍然面临着不同的挑战，因为（1）证明调查需要系统的探索和维持逻辑连贯性，并且（2）在每个推理步骤中搜索正确的前提组合在具有较大前提空间的任务中固有地具有挑战性。为了解决这个问题，我们提出了Logictree，这是一种使用算法引导的搜索来自动化结构化证明探索并确保逻辑相干性的推理时间模块化框架。除了思考之树（TOT）之外，我们将缓存机制纳入Logictree，以有效利用历史知识，防止推理停滞并最大程度地减少冗余。此外，我们通过将其分解为线性过程来解决前提搜索的组合复杂性。精致的前提选择将随后的推断限制为每个步骤最多一个派生，从而增强了推理粒度并执行严格的逐步推理。此外，我们介绍了两个无LLM的启发式方法以进行前提优先级，从而实现了战略证明搜索。五个数据集的实验结果表明，LogicTree在GPT-4O上的平均增长率分别为23.6％和12.5％，以实现更高的证明准确性，超过三分链（COT）和TOT，超过了较高的证明准确性，超过了较高的证明准确性，超过了较高的证明准确性。此外，在Logictree中，GPT-4O的表现平均优于O3-Mini 7.6％。

Title: PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models

Authors: Nusrat Jahan Prottasha, Upama Roy Chowdhury, Shetu Mohanto, Tasfia Nuzhat, Abdullah As Sami, Md Shamol Ali, Md Shohanur Islam Sobuj, Hafijur Raman, Md Kowsher, Ozlem Ozmen Garibay
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14117
Pdf URL: https://arxiv.org/pdf/2504.14117
Copy Paste: [[2504.14117]] PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models(https://arxiv.org/abs/2504.14117)
Keywords: language model, llm
Abstract: Large models such as Large Language Models (LLMs) and Vision Language Models (VLMs) have transformed artificial intelligence, powering applications in natural language processing, computer vision, and multimodal learning. However, fully fine-tuning these models remains expensive, requiring extensive computational resources, memory, and task-specific data. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a promising solution that allows adapting large models to downstream tasks by updating only a small portion of parameters. This survey presents a comprehensive overview of PEFT techniques, focusing on their motivations, design principles, and effectiveness. We begin by analyzing the resource and accessibility challenges posed by traditional fine-tuning and highlight key issues, such as overfitting, catastrophic forgetting, and parameter inefficiency. We then introduce a structured taxonomy of PEFT methods -- grouped into additive, selective, reparameterized, hybrid, and unified frameworks -- and systematically compare their mechanisms and trade-offs. Beyond taxonomy, we explore the impact of PEFT across diverse domains, including language, vision, and generative modeling, showing how these techniques offer strong performance with lower resource costs. We also discuss important open challenges in scalability, interpretability, and robustness, and suggest future directions such as federated learning, domain adaptation, and theoretical grounding. Our goal is to provide a unified understanding of PEFT and its growing role in enabling practical, efficient, and sustainable use of large models.
摘要：大型模型（例如大型语言模型（LLM）和视觉语言模型（VLM））改变了人工智能，在自然语言处理，计算机视觉和多模式学习中的应用程序。但是，完全微调这些模型仍然昂贵，需要广泛的计算资源，内存和特定于任务的数据。参数有效的微调（PEFT）已成为一种有希望的解决方案，它允许通过仅更新一小部分参数来调整大型模型以下游任务。这项调查介绍了PEFT技术的全面概述，重点是他们的动机，设计原则和有效性。我们首先分析传统的微调和突出关键问题所带来的资源和可及性挑战，例如过度拟合，灾难性遗忘和参数效率低下。然后，我们引入了PEFT方法的结构化分类学 - 分组为添加剂，选择性，重新聚集，混合和统一框架 - 并系统地比较其机制和权衡。除了分类法之外，我们还探讨了PEFT跨不同领域（包括语言，愿景和生成建模）的影响，以显示这些技术如何通过较低的资源成本提供强大的性能。我们还讨论了可伸缩性，可解释性和鲁棒性的重要开放挑战，并提出了未来的方向，例如联合学习，领域的适应性和理论基础。我们的目标是对PEFT及其在实现大型模型实现实用，高效和可持续的使用中的统一了解及其不断增长的作用。

Title: Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Authors: Katie Matton, Robert Osazuwa Ness, John Guttag, Emre Kıcıman
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14150
Pdf URL: https://arxiv.org/pdf/2504.14150
Copy Paste: [[2504.14150]] Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations(https://arxiv.org/abs/2504.14150)
Keywords: language model, llm
Abstract: Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.
摘要：大型语言模型（LLMS）能够产生合理的解释，说明它们如何得出问题的答案。但是，这些解释可能会歪曲模型的“推理”过程，即它们可能是不忠实的。反过来，这可能会导致过度信任和滥用。我们介绍了一种衡量LLM解释忠诚的新方法。首先，我们提供了对忠诚的严格定义。由于LLM的解释模仿了人类的解释，因此他们经常在据称影响该模型的输入问题中引用高级概念。我们根据LLM解释暗示具有影响力和真正的概念之间的差异来定义忠诚。其次，我们提出了一种基于以下方式估算忠诚的新方法：（1）使用辅助LLM修改模型输入中的概念值以创建现实的反事实，以及（2）使用贝叶斯层次模型量化示例和数据集级别的概念的因果关系。我们的实验表明，我们的方法可用于量化和发现不忠的可解释模式。在社会偏见的任务上，我们发现了LLM解释隐藏社会偏见的影响的情况。在一项医疗问题回答任务上，我们发现了LLM解释提供了关于哪些证据影响该模型决策的误解的案例。

Title: SConU: Selective Conformal Uncertainty in Large Language Models

Authors: Zhiyuan Wang, Qingni Wang, Yue Zhang, Tianlong Chen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14154
Pdf URL: https://arxiv.org/pdf/2504.14154
Copy Paste: [[2504.14154]] SConU: Selective Conformal Uncertainty in Large Language Models(https://arxiv.org/abs/2504.14154)
Keywords: language model
Abstract: As large language models are increasingly utilized in real-world applications, guarantees of task-specific metrics are essential for their reliable deployment. Previous studies have introduced various criteria of conformal uncertainty grounded in split conformal prediction, which offer user-specified correctness coverage. However, existing frameworks often fail to identify uncertainty data outliers that violate the exchangeability assumption, leading to unbounded miscoverage rates and unactionable prediction sets. In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level. Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions. Furthermore, we comprehensively analyze the components of the conformal procedures, aiming to approximate conditional coverage, particularly in high-stakes question-answering tasks.
摘要：由于大型语言模型越来越多地用于现实世界应用程序，因此特定于任务指标的保证对于其可靠的部署至关重要。先前的研究介绍了基于分裂保形预测的各种保形不确定性标准，这些预测提供了用户指定的正确性覆盖范围。但是，现有的框架通常无法确定违反交换性假设的不确定性数据异常值，从而导致误解率和无可可靠的预测集。在本文中，我们提出了一种称为选择性保形不确定性（SCONU）的新型方法，该方法首次通过开发两个结构性P值来实现显着性测试，这些p值在确定给定样品偏离校准设置的不确定性分布设置是否在特定可管理的风险水平上有助于。我们的方法不仅促进了对单域和跨学科环境中误解率的严格管理，而且还提高了预测的效率。此外，我们全面分析了保形过程的组成部分，旨在近似条件覆盖范围，尤其是在高风险的提问任务中。

Title: Self-Correction Makes LLMs Better Parsers

Authors: Ziyan Zhang, Yang Hou, Chen Gong, Zhenghua Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14165
Pdf URL: https://arxiv.org/pdf/2504.14165
Copy Paste: [[2504.14165]] Self-Correction Makes LLMs Better Parsers(https://arxiv.org/abs/2504.14165)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success across various natural language processing (NLP) tasks. However, recent studies suggest that they still face challenges in performing fundamental NLP tasks essential for deep language understanding, particularly syntactic parsing. In this paper, we conduct an in-depth analysis of LLM parsing capabilities, delving into the specific shortcomings of their parsing results. We find that LLMs may stem from limitations to fully leverage grammar rules in existing treebanks, which restricts their capability to generate valid syntactic structures. To help LLMs acquire knowledge without additional training, we propose a self-correction method that leverages grammar rules from existing treebanks to guide LLMs in correcting previous errors. Specifically, we automatically detect potential errors and dynamically search for relevant rules, offering hints and examples to guide LLMs in making corrections themselves. Experimental results on three datasets with various LLMs, demonstrate that our method significantly improves performance in both in-domain and cross-domain settings on the English and Chinese datasets.
摘要：大型语言模型（LLMS）在各种自然语言处理（NLP）任务中取得了巨大的成功。但是，最近的研究表明，它们在执行基本的NLP任务时仍面临挑战，这对于深入的语言理解至关重要，尤其是句法解析。在本文中，我们对LLM解析能力进行了深入的分析，并深入研究了其解析结果的具体缺点。我们发现LLM可能源于现有树库中完全利用语法规则的局限性，这限制了它们产生有效句法结构的能力。为了帮助LLM在没有额外培训的情况下获得知识，我们提出了一种自我纠正方法，该方法利用现有树库的语法规则来指导LLMS纠正先前的错误。具体来说，我们会自动检测潜在错误并动态搜索相关规则，并提供提示和示例，以指导LLMS自行进行更正。在三个具有各种LLM的数据集上的实验结果表明，我们的方法显着提高了英语和中文数据集中的内域和跨域设置的性能。

Title: Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Authors: Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.14175
Pdf URL: https://arxiv.org/pdf/2504.14175
Copy Paste: [[2504.14175]] Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion(https://arxiv.org/abs/2504.14175)
Keywords: language model, llm
Abstract: Query expansion methods powered by large language models (LLMs) have demonstrated effectiveness in zero-shot retrieval tasks. These methods assume that LLMs can generate hypothetical documents that, when incorporated into a query vector, enhance the retrieval of real evidence. However, we challenge this assumption by investigating whether knowledge leakage in benchmarks contributes to the observed performance gains. Using fact verification as a testbed, we analyzed whether the generated documents contained information entailed by ground truth evidence and assessed their impact on performance. Our findings indicate that performance improvements occurred consistently only for claims whose generated documents included sentences entailed by ground truth evidence. This suggests that knowledge leakage may be present in these benchmarks, inflating the perceived performance of LLM-based query expansion methods, particularly in real-world scenarios that require retrieving niche or novel knowledge.
摘要：大型语言模型（LLMS）提供动力的查询扩展方法在零射击检索任务中表现出有效性。这些方法假设LLM可以生成假设文件，这些文件将其纳入查询向量时会增强实际证据的检索。但是，我们通过研究基准中的知识泄漏是否有助于观察到的性能提高来挑战这一假设。使用事实验证作为测试台，我们分析了生成的文件是否包含了地面真相证据所需的信息，并评估了它们对绩效的影响。我们的发现表明，只有针对生成文件的索赔，包括地面真相证据所需的句子。这表明知识泄漏可能存在于这些基准测试中，从而夸大了基于LLM的查询扩展方法的感知性能，尤其是在需要检索利基或新知识的现实情况下。

Title: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Authors: Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14194
Pdf URL: https://arxiv.org/pdf/2504.14194
Copy Paste: [[2504.14194]] Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models(https://arxiv.org/abs/2504.14194)
Keywords: language model, llm
Abstract: The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.
摘要：大型语言模型（LLMS）的预培训数据集的组成在很大程度上仍然没有公开，阻碍了透明度和优化数据质量的努力，这是模型性能的关键驱动力。当前的数据选择方法，例如自然语言质量评估，基于多样性的过滤器和基于分类器的方法，受到单维评估或以冗余策略的限制。为了解决这些差距，我们建议PRRC评估跨专业，可读性，推理和清洁度的数据质量。我们进一步介绍了Meta-Rater，这是一种多维数据选择方法，该方法通过学习的最佳权重将这些维度与现有质量指标集成在一起。 Meta-Rater采用代理模型来训练一个预测验证损失的回归模型，从而可以鉴定质量得分的最佳组合。实验表明，Meta-Rater可为1.3B参数模型加倍收敛速度，并将下游任务性能提高3.23，在100B代币训练的3.3B模型中观察到可扩展的好处。此外，我们发布了带有25个质量指标（包括PRRC）的标签的注释的Slimpajama-627B数据集，以推动以数据为中心的LLM开发研究。我们的工作确定了整体，多维质量的整合显着优于常规的单维方法，提供了可扩展的范式，以提高训练效率和模型能力。

Title: Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

Authors: Takuma Udagawa, Yang Zhao, Hiroshi Kanayama, Bishwaranjan Bhattacharjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14212
Pdf URL: https://arxiv.org/pdf/2504.14212
Copy Paste: [[2504.14212]] Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification(https://arxiv.org/abs/2504.14212)
Keywords: language model, llm
Abstract: Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.
摘要：大型语言模型（LLMS）从大规模的预处理中获取一般语言知识。但是，预处理的数据主要由网络爬行的文本组成，其中包含不良的社会偏见，这些偏见可能会被LLMS永久化甚至扩大。在这项研究中，我们提出了一个有效但有效的注释管道，以调查预训练的语料库中的社会偏见。我们的管道由受保护的属性检测组成，以识别各种人口统计学，然后考虑分类以分析每个属性的语言极性。通过我们的实验，我们证明了我们的偏差分析和缓解措施的效果，重点是普通爬网作为最具代表性的训练预处理。

Title: Understanding the Repeat Curse in Large Language Models from a Feature Perspective

Authors: Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14218
Pdf URL: https://arxiv.org/pdf/2504.14218
Copy Paste: [[2504.14218]] Understanding the Repeat Curse in Large Language Models from a Feature Perspective(https://arxiv.org/abs/2504.14218)
Keywords: language model, llm
Abstract: Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the "Repeat Curse". While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach, "Duplicatus Charm", to induce and analyze the Repeat Curse. Our method systematically identifies "Repetition Features" -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse.
摘要：大型语言模型（LLM）在各个领域取得了显着的进步，但它们经常遭受重复的文本生成的困扰，这是我们称为“重复诅咒”的现象。尽管以前的研究提出了解码策略来减轻重复的重复，但该问题背后的基本机制仍未得到充分探索。在这项工作中，我们通过机械解释性镜头研究了LLM中重复的根本原因。受到稀疏自动编码器（SAE）的最新进展的启发（SAES），它可以提取单语义特征，我们提出了一种新颖的方法，即“ deplicatus Charm”，以诱导和分析重复诅咒。我们的方法系统地标识了“重复功能” - 负责生成重复输出的关键模型激活。首先，我们通过logit分析找到重复涉及的层。接下来，我们使用基于SAE的激活操作提取和刺激相关特征。为了验证我们的方法，我们构建了一个涵盖令牌和段落级别重复的重复数据集，并引入评估管道以量化已确定的重复功能的影响。此外，通过停用这些特征，我们有效地减轻了重复诅咒。

Title: SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text Simplification

Authors: Michael Färber, Parisa Aghdam, Kyuri Im, Mario Tawfelis, Hardik Ghoshal
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.14223
Pdf URL: https://arxiv.org/pdf/2504.14223
Copy Paste: [[2504.14223]] SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text Simplification(https://arxiv.org/abs/2504.14223)
Keywords: language model, gpt, llm
Abstract: Text simplification is essential for making complex content accessible to diverse audiences who face comprehension challenges. Yet, the limited availability of simplified materials creates significant barriers to personal and professional growth and hinders social inclusion. Although researchers have explored various methods for automatic text simplification, none fully leverage large language models (LLMs) to offer tailored customization for different target groups and varying levels of simplicity. Moreover, despite its proven benefits for both consumers and organizations, the well-established practice of plain language remains underutilized. In this paper, we this https URL, the first system designed to produce plain language content from multiple input formats, including typed text and file uploads, with flexible customization options for diverse audiences. We employ GPT-4 and Llama-3 and evaluate outputs across multiple metrics. Overall, our work contributes to research on automatic text simplification and highlights the importance of tailored communication in promoting inclusivity.
摘要：简化文本对于使面临理解挑战的不同受众可以访问复杂的内容至关重要。然而，简化材料的可用性有限为个人和专业增长带来了重大障碍，并阻碍了社会包容性。尽管研究人员探索了自动文本简化的各种方法，但没有一个完全利用大型语言模型（LLMS）为不同目标组和不同水平的简单性提供定制的自定义。此外，尽管对消费者和组织都有证明的好处，但普通语言的良好实践仍未得到充分利用。在本文中，我们这个HTTPS URL是第一个旨在从多种输入格式生成普通语言内容的系统，包括键入的文本和文件上传，并具有适用于不同受众的灵活自定义选项。我们采用GPT-4和Llama-3并评估多个指标的产出。总体而言，我们的工作有助于研究自动文本简化，并突出了量身定制的沟通在促进包容性方面的重要性。

Title: Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Authors: Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14225
Pdf URL: https://arxiv.org/pdf/2504.14225
Copy Paste: [[2504.14225]] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale(https://arxiv.org/abs/2504.14225)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at this http URL.
摘要：大型语言模型（LLM）已成为各种任务的用户的个性化助手 - 从提供写作支持到提供量身定制的建议或咨询。随着时间的流逝，用户和LLM之间的交互历史可以提供有关个人特征和偏好的广泛信息。但是，关于当今LLM可以有效利用此历史的方式的开放问题仍然是（1）内部化用户的固有特征和偏好，（2）跟踪用户分析和偏好随着时间的推移如何发展，以及（3）在新场景中相应地生成个性化响应。在这项工作中，我们介绍了PersonAmem基准。 PersonAmem具有策划的用户配置文件，具有超过180个模拟的用户-LLM交互历史，每个历史记录都包含15个需要个性化的15个现实世界中的多达60次多转交流。给定了原位用户查询，即用户从第一人称角度发出的查询，我们评估了LLM Chatbots根据用户配置文件的当前状态识别最合适响应的能力。我们观察到，当前的LLM仍然很难通过直接提示方法来识别用户配置文件中用户配置文件中的动态演变。结果，LLMS通常无法提供与用户当前情况和偏好保持一致的响应，例如GPT-4.1，O4-MINI，GPT-4.5，O1或Gemini-2.0等边界模型仅实现50％的整体准确性，这表明整体准确性均可提高空间。我们希望PersonAmem以及用户个人资料和对话模拟管道能够促进未来的研究，以开发真正的用户意识聊天机器人。代码和数据可在此HTTP URL上找到。

Title: Probing the Subtle Ideological Manipulation of Large Language Models

Authors: Demetris Paschalides, George Pallis, Marios D. Dikaiakos
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14287
Pdf URL: https://arxiv.org/pdf/2504.14287
Copy Paste: [[2504.14287]] Probing the Subtle Ideological Manipulation of Large Language Models(https://arxiv.org/abs/2504.14287)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have transformed natural language processing, but concerns have emerged about their susceptibility to ideological manipulation, particularly in politically sensitive areas. Prior work has focused on binary Left-Right LLM biases, using explicit prompts and fine-tuning on political QA datasets. In this work, we move beyond this binary approach to explore the extent to which LLMs can be influenced across a spectrum of political ideologies, from Progressive-Left to Conservative-Right. We introduce a novel multi-task dataset designed to reflect diverse ideological positions through tasks such as ideological QA, statement ranking, manifesto cloze completion, and Congress bill comprehension. By fine-tuning three LLMs-Phi-2, Mistral, and Llama-3-on this dataset, we evaluate their capacity to adopt and express these nuanced ideologies. Our findings indicate that fine-tuning significantly enhances nuanced ideological alignment, while explicit prompts provide only minor refinements. This highlights the models' susceptibility to subtle ideological manipulation, suggesting a need for more robust safeguards to mitigate these risks.
摘要：大型语言模型（LLM）改变了自然语言处理，但人们对它们对意识形态操纵的敏感性产生了担忧，尤其是在政治上敏感的领域。先前的工作专注于二进制左右LLM偏见，并使用明确的提示和对政治质量检查数据集进行微调。在这项工作中，我们超越了这种二进制方法，以探索LLM在多种政治意识形态的影响，从逐渐左右到保守的权利。我们介绍了一个新颖的多任务数据集，旨在通过意识形态质量保证，声明排名，宣言结束和国会法案理解来反映各种意识形态的立场。通过微调该数据集的三个LLMS-PHI-2，MISTRAL和LLAMA-3-ON，我们评估了它们采用和表达这些细微的意识形态的能力。我们的发现表明，微调显着增强了细微的意识形态一致性，而显式提示仅提供较小的改进。这突出了模型对微妙的意识形态操纵的敏感性，这表明需要更强大的保障措施来减轻这些风险。

Title: Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Authors: Patrick Haller, Jonas Golde, Alan Akbik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14366
Pdf URL: https://arxiv.org/pdf/2504.14366
Copy Paste: [[2504.14366]] Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models(https://arxiv.org/abs/2504.14366)
Keywords: language model, llm
Abstract: Knowledge distillation is a widely used technique for compressing large language models (LLMs) by training a smaller student model to mimic a larger teacher model. Typically, both the teacher and student are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention at inference time remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher to nine subquadratic student architectures. Our study aims to determine which subquadratic model best aligns with the teacher's learned representations and how different architectural constraints influence the distillation process. We also investigate the impact of intelligent initialization strategies, including matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.
摘要：知识蒸馏是一种通过训练较小的学生模型来模仿较大的教师模型来压缩大语言模型（LLM）的广泛使用技术。通常，教师和学生都是基于变压器的架构，利用SoftMax的注意来进行序列建模。然而，推理时间自我注意的二次复杂性仍然是一个重要的瓶颈，激发了探索次级替代方案，例如结构化状态空间模型（SSM），线性注意力和经常性架构。在这项工作中，我们系统地评估了从变压器老师到九个次级学生体系结构的知识蒸馏的可传递性。我们的研究旨在确定哪种次级模型最能与教师的学会表示形式保持一致，以及不同的建筑约束如何影响蒸馏过程。我们还研究了智能初始化策略的影响，包括矩阵混合和查询键值（QKV）复制，对适应过程。我们对多个NLP基准测试的经验结果为效率和绩效之间的权衡提供了见解，突出了成功将知识转移到次级体系结构的关键因素。

Title: Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites

Authors: Gabriel Machado Santos, Rita Maria da Silva Julia, Marcelo Zanchetta do Nascimento
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14367
Pdf URL: https://arxiv.org/pdf/2504.14367
Copy Paste: [[2504.14367]] Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites(https://arxiv.org/abs/2504.14367)
Keywords: language model, llm, prompt
Abstract: Prompt engineering is essential for optimizing large language models (LLMs), yet the link between prompt structures and task performance remains underexplored. This work introduces an evolutionary approach that combines context-free grammar (CFG) with the MAP-Elites algorithm to systematically explore the prompt space. Our method prioritizes quality and diversity, generating high-performing and structurally varied prompts while analyzing their alignment with diverse tasks by varying traits such as the number of examples (shots) and reasoning depth. By systematically mapping the phenotypic space, we reveal how structural variations influence LLM performance, offering actionable insights for task-specific and adaptable prompt design. Evaluated on seven BigBench Lite tasks across multiple LLMs, our results underscore the critical interplay of quality and diversity, advancing the effectiveness and versatility of LLMs.
摘要：及时工程对于优化大型语言模型（LLM）至关重要，但是提示结构和任务性能之间的链接仍然没有得到充实的态度。这项工作引入了一种进化方法，该方法将无上下文的语法（CFG）与MAP-Elites算法结合在一起，以系统地探索及时空间。我们的方法优先考虑质量和多样性，产生高性能和结构上不同的提示，同时通过改变示例数（镜头）和推理深度等特征来分析其与各种任务的一致性。通过系统地映射表型空间，我们揭示了结构变化如何影响LLM性能，从而为特定于任务和适应性的及时设计提供了可行的见解。在多个LLM的七个BigBench Lite任务上进行了评估，我们的结果强调了质量和多样性的关键相互作用，从而提高了LLM的有效性和多功能性。

Title: ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Authors: Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14452
Pdf URL: https://arxiv.org/pdf/2504.14452
Copy Paste: [[2504.14452]] ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data(https://arxiv.org/abs/2504.14452)
Keywords: language model, prompt
Abstract: Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).
摘要：语言模型（LMS）也可以记住和复制其在非对抗性环境中逐字化数据中的细分市场，从而引发了对版权，窃，隐私和创造力的担忧。我们介绍了释义偏好优化（PARAPO），这是一种训练后方法，可以微调LMS减少意外反流的同时保留其整体效用。帕拉波（Parapo）训练LMS，更喜欢释义的记忆段的版本，而不是预处理数据的原始逐字含量。为了保持在适当时回忆起著名报价的能力，我们开发了一种使用系统提示来控制反流行为的帕拉波变体。在我们对Llama3.1-8B的评估中，Parapo始终减少所有测试数据集的反流（例如，在创意写作中从17.3降低到12.9的反流量指标），而在以前的工作中使用的无效方法用于减轻其目标的无效距离未能进行的距离，从17.9.3降低了目标的有效性。当应用于指令调整的TULU3-8B模型时，帕拉波（Parapo）的系统促使系统促使著名的报价召回，同时降低了无意的反流（创意写作中的8.7至6.3），当提示不进行反流时。相反，如果不进行parapo调整，则促使该模型不反流只会产生边缘减少（8.7至8.4）。

Title: CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

Authors: Armin Toroghi, Willis Guo, Scott Sanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14462
Pdf URL: https://arxiv.org/pdf/2504.14462
Copy Paste: [[2504.14462]] CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge(https://arxiv.org/abs/2504.14462)
Keywords: language model, llm, hallucination
Abstract: The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.
摘要：大型语言模型（LLM）的兴起重新定义了AI景观，尤其是由于它们能够编码事实和常识知识的能力，以及在需要推理的任务中出色的表现。尽管有这些进展，但幻觉和推理错误仍然是其在高风险环境中部署的重大障碍。在这项工作中，我们观察到，即使是最突出的LLM，例如OpenAI-O1，也遭受了高度的推理错误和幻觉，而幻觉则需要对晦涩的，长尾实体进行常识性推理。为了调查这一限制，我们提出了一个新的数据集，用于对长尾实体（COLOTA）的常识性推理，该数据集由问题回答和要求验证任务中的3,300个查询组成，并涵盖了各种常识性推理技能。我们指出，Colota还可以用作知识图答录（KGQA）数据集，因为Wikidata知识图中存在需要回答其查询所需的知识的支持。但是，与仅关注Factoid问题的现有KGQA基准相反，我们的Colota查询也需要常识性推理。我们对基于LLM的KGQA方法的实验表明，他们严重无法回答涉及常识性推理的查询。因此，我们提出COLOTA作为评估（I）LLM常识性推理能力及其对长尾实体幻觉的鲁棒性的新基准，以及（ii）KGQA方法的常识性推理能力。

Title: DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

Authors: Xiang Li, Duyi Pan, Hongru Xiao, Jiale Han, Jing Tang, Jiabao Ma, Wei Wang, Bo Cheng
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2504.14482
Pdf URL: https://arxiv.org/pdf/2504.14482
Copy Paste: [[2504.14482]] DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue(https://arxiv.org/abs/2504.14482)
Keywords: agent
Abstract: Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code this https URL to facilitate future research on advanced speech synthesis models and customized data generation.
摘要：语音合成对于人类计算机的相互作用至关重要，实现了自然和直观的交流。但是，现有的数据集涉及由于手动注释而导致的高建筑成本，并且特征多样性，上下文场景和情感表现力都受到有限的影响。为了解决这些问题，我们提出了Dialogueagents，这是一种基于混合代理的新型语音合成框架，该框架集成了三个专业代理 - 脚本作者，语音合成器和对话评论家 - 共同生成对话。该框架基于多样化的角色池，迭代地完善了对话脚本，并根据语音审查综合了语音，增强了情感表现力和综合对话的副语言特征。使用DialogueAgent，我们贡献了Multitalk，这是一种双语，多方，多方面的语音对话数据集，涵盖了各种主题。广泛的实验证明了我们的框架的有效性和多通信数据集的高质量。我们发布数据集并编码此HTTPS URL，以促进对高级语音合成模型和定制数据生成的未来研究。

Title: FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

Authors: Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14492
Pdf URL: https://arxiv.org/pdf/2504.14492
Copy Paste: [[2504.14492]] FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering(https://arxiv.org/abs/2504.14492)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.
摘要：大型语言模型（LLMS）容易捕获培训语料库的偏见，从而产生潜在的负面社会影响。现有的基于及时的脱缩式方法由于对迅速变化的敏感性而表现出不稳定，而基于微调的技术会导致大量的计算开销和灾难性遗忘。在本文中，我们提出了Fairstore，这是一种新颖的推理时间偏见框架，而无需自定义的及时设计或模型再培训。我们的初步研究是由线性表示假设的动机，表明与公平相关的特征可以编码为隐藏激活空间中的可分离方向。 FairSteer分为三个步骤：偏置激活检测，偏置转向向量（DSV）计算和动态激活转向。具体而言，它首先训练轻质线性分类器以检测激活中的偏置特征，然后将DSV计算为源自小对比度及时对的干预方向。随后，它通过在推理阶段调整DSV的激活来执行偏见。六个LLM的全面评估证明了FairStore在提问，反事实输入评估和开放式文本生成任务中的优势。代码将发布。

Title: Functional Abstraction of Knowledge Recall in Large Language Models

Authors: Zijian Wang, Chang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14496
Pdf URL: https://arxiv.org/pdf/2504.14496
Copy Paste: [[2504.14496]] Functional Abstraction of Knowledge Recall in Large Language Models(https://arxiv.org/abs/2504.14496)
Keywords: language model, llm, prompt
Abstract: Pre-trained transformer large language models (LLMs) demonstrate strong knowledge recall capabilities. This paper investigates the knowledge recall mechanism in LLMs by abstracting it into a functional structure. We propose that during knowledge recall, the model's hidden activation space implicitly entails a function execution process where specific activation vectors align with functional components (Input argument, Function body, and Return values). Specifically, activation vectors of relation-related tokens define a mapping function from subjects to objects, with subject-related token activations serving as input arguments and object-related token activations as return values. For experimental verification, we first design a patching-based knowledge-scoring algorithm to identify knowledge-aware activation vectors as independent functional components. Then, we conduct counter-knowledge testing to examine the independent functional effects of each component on knowledge recall outcomes. From this functional perspective, we improve the contextual knowledge editing approach augmented by activation patching. By rewriting incoherent activations in context, we enable improved short-term memory retention for new knowledge prompting.
摘要：预训练的变压器大语言模型（LLMS）表现出强大的知识回忆能力。本文通过将LLM中的知识回忆机制提取为功能结构来研究知识回忆机制。我们建议，在知识回忆中，模型的隐藏激活空间隐含了一个函数执行过程，其中特定的激活向量与功能组件（输入参数，函数主体和返回值）对齐。具体而言，与关系相关令牌的激活向量定义了从受试者到对象的映射函数，主题相关的令牌激活用作输入参数和与对象相关的令牌激活作为返回值。为了实验验证，我们首先设计了一种基于修补的知识评分算法，以识别知识吸引的激活向量作为独立的功能组件。然后，我们进行反知识测试，以检查每个组件对知识回忆结果的独立功能效应。从这个功能的角度来看，我们改善了通过激活修补增强的上下文知识编辑方法。通过在上下文中重写不一致的激活，我们可以改善新知识提示的短期记忆保留。

Title: Causality for Natural Language Processing

Authors: Zhijing Jin
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14530
Pdf URL: https://arxiv.org/pdf/2504.14530
Copy Paste: [[2504.14530]] Causality for Natural Language Processing(https://arxiv.org/abs/2504.14530)
Keywords: language model, llm
Abstract: Causal reasoning is a cornerstone of human intelligence and a critical capability for artificial systems aiming to achieve advanced understanding and decision-making. This thesis delves into various dimensions of causal reasoning and understanding in large language models (LLMs). It encompasses a series of studies that explore the causal inference skills of LLMs, the mechanisms behind their performance, and the implications of causal and anticausal learning for natural language processing (NLP) tasks. Additionally, it investigates the application of causal reasoning in text-based computational social science, specifically focusing on political decision-making and the evaluation of scientific impact through citations. Through novel datasets, benchmark tasks, and methodological frameworks, this work identifies key challenges and opportunities to improve the causal capabilities of LLMs, providing a comprehensive foundation for future research in this evolving field.
摘要：因果推理是人类智力的基石，也是旨在实现高级理解和决策的人工系统的关键能力。本文在大型语言模型（LLMS）中研究了因果推理和理解的各个方面。它涵盖了一系列研究，探讨了LLM的因果推理技能，其表现背后的机制以及因果和抗苏族学习对自然语言处理（NLP）任务的影响。此外，它研究了因果推理在基于文本的计算社会科学中的应用，特别关注政治决策和通过引用评估科学影响。通过新颖的数据集，基准任务和方法论框架，这项工作确定了提高LLM的因果能力的关键挑战和机会，为这一不断发展的领域的未来研究提供了全面的基础。

Title: BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Authors: Yiting Ran, Xintao Wang, Tian Qiu, Jiaqing Liang, Yanghua Xiao, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14538
Pdf URL: https://arxiv.org/pdf/2504.14538
Copy Paste: [[2504.14538]] BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation(https://arxiv.org/abs/2504.14538)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld's design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code of this paper can be found at the project page: this https URL.
摘要：大型语言模型（LLM）的最新进展已通过多代理系统实现了社交模拟。先前的努力专注于从头开始创建的代理社会，将代理商分配给了新定义的角色。然而，尽管实用的实践价值很高，但模拟既定的虚构世界和角色仍然在很大程度上尚未得到充实。在本文中，我们介绍了BookWorld，这是一个综合系统，用于构建和模拟基于书籍的多代理社会。 BookWorld的设计涵盖了综合现实世界中的复杂性，包括各种和动态角色，虚构的世界观，地理约束和变化，E.T.C。 BookWorld启用了各种应用程序，包括故事产生，互动游戏和社交模拟，提供了扩展和探索心爱的虚构作品的新颖方法。通过广泛的实验，我们证明了BookWorld产生创意，高质量的故事，同时保持对源书的忠诚，并以75.36％的胜率超过了以前的方法。本文的代码可以在项目页面上找到：此HTTPS URL。

Title: a1: Steep Test-time Scaling Law via Environment Augmented Generation

Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14597
Pdf URL: https://arxiv.org/pdf/2504.14597
Copy Paste: [[2504.14597]] a1: Steep Test-time Scaling Law via Environment Augmented Generation(https://arxiv.org/abs/2504.14597)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.
摘要：大型语言模型（LLMS）在推理方面取得了显着突破，但在复杂的多步骤任务中继续在幻觉，逻辑错误和无力自我纠正方面继续困难。当前的方法诸如经过思考促进链的方法提供有限的推理能力，这些功能在需要精确的步骤验证时失败。我们提出了环境增强发电（EAG），该框架通过以下方式增强LLM推理：（1）实时环境反馈验证每个推理步骤，（2）动态分支探索，用于研究误差时研究替代解决方案路径的动态分支，（3）（3）从成功推理轨迹中基于经验的基于经验的学习。与现有方法不同，EAG可以通过将执行反馈与分支探索的紧密整合到分支探索中进行故意的回溯和战略性重新掌握。我们的A1-32B模型在所有基准测试中都达到了相似大小的模型之间的最先进性能，与竞争数学上的O1（例如O1）相匹配，同时表现优于可比较的模型高达24.4个百分点。分析揭示了EAG独特的缩放模式：对环境相互作用的初始标记投资产生了大量的长期绩效股息，优势与任务复杂性成比例地放大。 EAG的理论框架展示了环境互动性和系统分支探索如何共同建立一个用于可靠的机器推理的新范式，尤其是对于需要精确的多步计算和逻辑验证的问题。

Title: Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations

Authors: Yuri Balashov, Alex Balashov, Shiho Fukuda Koski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14619
Pdf URL: https://arxiv.org/pdf/2504.14619
Copy Paste: [[2504.14619]] Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations(https://arxiv.org/abs/2504.14619)
Keywords: language model
Abstract: This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies for individual translators and language service providers with modest resources. The advent of advanced neural machine translation systems, large language models, and their integration into workflows via computer-assisted translation tools and translation management systems have reshaped the translation landscape. These advancements enable not only translation but also quality evaluation, error spotting, glossary generation, and adaptation to domain-specific needs, creating new technical opportunities for freelancers. In this series, we aim to empower translators with actionable methods to harness these advancements. Our approach emphasizes Translation Analytics, a suite of evaluation techniques traditionally reserved for large-scale industry applications but now becoming increasingly available for smaller-scale users. This first paper introduces a practical framework for adapting automatic evaluation metrics -- such as BLEU, chrF, TER, and COMET -- to freelancers' needs. We illustrate the potential of these metrics using a trilingual corpus derived from a real-world project in the medical domain and provide statistical analysis correlating human evaluations with automatic scores. Our findings emphasize the importance of proactive engagement with emerging technologies to not only adapt but thrive in the evolving professional environment.
摘要：这是一系列论文中的第一篇，探讨了最近对具有适度资源的单个翻译人员和语言服务提供商的语言技术进步产生的新机会。高级神经机器翻译系统，大语言模型的出现以及通过计算机辅助的翻译工具和翻译管理系统整合到工作流程中，已重塑了翻译景观。这些进步不仅可以翻译，还可以质量评估，错误发现，术语表的生成以及适应特定领域的需求，从而为自由职业者创造新的技术机会。在本系列中，我们旨在通过可行的方法来利用这些进步。我们的方法强调翻译分析，这是一套评估技术，传统上保留给大型行业应用，但现在越来越多地用于较小规模的用户。第一篇论文介绍了一个实用框架，用于调整自动评估指标（例如BLEU，CHRF，TER和COMET），以适应自由职业者的需求。我们使用从医学领域中的现实世界项目得出的三语语料库来说明这些指标的潜力，并提供将人类评估与自动分数相关的统计分析。我们的发现强调了积极参与新兴技术的重要性，不仅在不断发展的专业环境中适应而且蓬勃发展。

Title: A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models

Authors: Hongming Tan, Shaoxiong Zhan, Fengwei Jia, Hai-Tao Zheng, Wai Kin Chan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14620
Pdf URL: https://arxiv.org/pdf/2504.14620
Copy Paste: [[2504.14620]] A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models(https://arxiv.org/abs/2504.14620)
Keywords: language model, llm, prompt
Abstract: Measuring scientific paper innovation is both important and challenging. Existing content-based methods often overlook the full-paper context, fail to capture the full scope of innovation, and lack generalization. We propose HSPIM, a hierarchical and training-free framework based on large language models (LLMs). It introduces a Paper-to-Sections-to-QAs decomposition to assess innovation. We segment the text by section titles and use zero-shot LLM prompting to implement section classification, question-answering (QA) augmentation, and weighted novelty scoring. The generated QA pair focuses on section-level innovation and serves as additional context to improve the LLM scoring. For each chunk, the LLM outputs a novelty score and a confidence score. We use confidence scores as weights to aggregate novelty scores into a paper-level innovation score. To further improve performance, we propose a two-layer question structure consisting of common and section-specific questions, and apply a genetic algorithm to optimize the question-prompt combinations. Comprehensive experiments on scientific conference paper datasets show that HSPIM outperforms baseline methods in effectiveness, generalization, and interpretability.
摘要：衡量科学纸创新既重要又具有挑战性。现有的基于内容的方法通常忽略了全纸环境，无法捕获创新的全部范围，并且缺乏概括。我们提出了HSPIM，这是一个基于大语言模型（LLM）的层次结构和无培训框架。它引入了纸间到QAS的分解，以评估创新。我们按部分标题进行细分文本，并使用零摄影LLM提示实现截面分类，提问（QA）增强和加权新颖性评分。生成的质量检查对重点介绍了部分级创新，并作为提高LLM评分的其他背景。对于每个块，LLM都会输出新颖的评分和置信度得分。我们将置信度得分作为权重，将新颖性得分汇总为纸级创新评分。为了进一步提高性能，我们提出了一个两层问题结构，该结构由常见和特定于部分的问题组成，并应用遗传算法来优化问题预付的组合。科学会议论文数据集的全面实验表明，HSPIM在有效性，概括和解释性方面优于基线方法。

Title: Automatic Text Summarization (ATS) for Research Documents in Sorani Kurdish

Authors: Rondik Hadi Abdulrahman, Hossein Hassani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14630
Pdf URL: https://arxiv.org/pdf/2504.14630
Copy Paste: [[2504.14630]] Automatic Text Summarization (ATS) for Research Documents in Sorani Kurdish(https://arxiv.org/abs/2504.14630)
Keywords: language model
Abstract: Extracting concise information from scientific documents aids learners, researchers, and practitioners. Automatic Text Summarization (ATS), a key Natural Language Processing (NLP) application, automates this process. While ATS methods exist for many languages, Kurdish remains underdeveloped due to limited resources. This study develops a dataset and language model based on 231 scientific papers in Sorani Kurdish, collected from four academic departments in two universities in the Kurdistan Region of Iraq (KRI), averaging 26 pages per document. Using Sentence Weighting and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms, two experiments were conducted, differing in whether the conclusions were included. The average word count was 5,492.3 in the first experiment and 5,266.96 in the second. Results were evaluated manually and automatically using ROUGE-1, ROUGE-2, and ROUGE-L metrics, with the best accuracy reaching 19.58%. Six experts conducted manual evaluations using three criteria, with results varying by document. This research provides valuable resources for Kurdish NLP researchers to advance ATS and related fields.
摘要：从科学文档中提取简洁的信息AIDS学习者，研究人员和从业人员。自动文本摘要（ATS）是一种关键的自然语言处理（NLP）应用程序，可以自动化此过程。尽管存在许多语言的ATS方法，但由于资源有限，库尔德人仍然不发达。这项研究开发了一个基于索拉尼·库尔德（Sorani Kurdish）的231篇科学论文的数据集和语言模型，该论文是从伊拉克库尔德斯坦地区两所大学的四个学术系（KRI）收集的，平均每个文件平均26页。使用句子加权和术语频率截止文档频率（TF-IDF）算法，进行了两个实验，在包括结论是否包括在内。在第一个实验中，平均单词计数为5,492.3，第二个实验为5,266.96。使用Rouge-1，Rouge-2和Rouge-L指标对结果进行了手动和自动评估，最佳准确度达到19.58％。六名专家使用三个标准进行了手动评估，结果随文档而异。这项研究为库尔德NLP研究人员提供了宝贵的资源，以推进ATS和相关领域。

Title: Harnessing Generative LLMs for Enhanced Financial Event Entity Extraction Performance

Authors: Soo-joon Choi, Ji-jun Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14633
Pdf URL: https://arxiv.org/pdf/2504.14633
Copy Paste: [[2504.14633]] Harnessing Generative LLMs for Enhanced Financial Event Entity Extraction Performance(https://arxiv.org/abs/2504.14633)
Keywords: language model, llm
Abstract: Financial event entity extraction is a crucial task for analyzing market dynamics and building financial knowledge graphs, yet it presents significant challenges due to the specialized language and complex structures in financial texts. Traditional approaches often rely on sequence labeling models, which can struggle with long-range dependencies and the inherent complexity of extracting multiple, potentially overlapping entities. Motivated by the advanced language understanding and generative capabilities of Large Language Models (LLMs), we propose a novel method that reframes financial event entity extraction as a text-to-structured-output generation task. Our approach involves fine-tuning a pre-trained LLM using Parameter-Efficient Fine-Tuning (PEFT) to directly generate a structured representation, such as a JSON object, containing the extracted entities and their precise character spans from the input text. We evaluate our method on the challenging CCKS 2019 Financial Event Entity Extraction dataset, comparing its performance against strong sequence labeling baselines, including SEBERTNets and sebertNets. Experimental results demonstrate that our generative LLM method achieves a new state-of-the-art F1 score on this benchmark, significantly outperforming previous methods. Through detailed quantitative analysis across event types, entity types, and instance complexity, as well as human evaluation, we show that our approach is more effective at handling the nuances of financial text and extracting high-quality entities. This work validates the potential of applying generative LLMs directly to complex, domain-specific information extraction tasks requiring structured output.
摘要：金融事件实体提取是分析市场动态和建立财务知识图的至关重要的任务，但是由于财务经文中的专业语言和复杂的结构，它带来了重大挑战。传统方法通常依赖于序列标记模型，这些模型可能会与远距离依赖关系和提取多个潜在重叠实体的固有复杂性作斗争。由大型语言模型（LLMS）的先进语言理解和生成能力的动机，我们提出了一种新颖的方法，将金融事件实体提取作为文本到结构化的输出生成任务。我们的方法涉及使用参数效率微调（PEFT）对预训练的LLM进行微调，以直接生成结构化表示形式，例如JSON对象，其中包含提取的实体及其从输入文本中的精确字符跨度。我们评估了我们在挑战性的CCKS 2019金融事件实体提取数据集上的方法，将其性能与包括Sebertnets和Sebertnet在内的强序列标签基线进行了比较。实验结果表明，我们的生成LLM方法在此基准上获得了新的最先进的F1分数，并且表现明显优于先前的方法。通过跨事件类型，实体类型和实例复杂性以及人类评估的详细定量分析，我们表明我们的方法更有效地处理财务文本的细微差别和提取高质量的实体。这项工作验证了将生成LLM直接应用于需要结构化输出的复杂的，特定于域的信息提取任务的潜力。

Title: A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs

Authors: Yihan Lin, Zhirong Bella Yu, Simon Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14657
Pdf URL: https://arxiv.org/pdf/2504.14657
Copy Paste: [[2504.14657]] A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs(https://arxiv.org/abs/2504.14657)
Keywords: language model, llm
Abstract: Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.
摘要：合成电子健康记录（EHRS）提供了一个宝贵的机会来创建隐私和协调的结构化数据，从而支持医疗保健中的众多应用。合成数据的关键好处包括对数据模式的精确控制，改善的公平性和患者人群的表示以及共享数据集的能力而无需担心损害真实个人隐私。因此，AI社区越来越多地转向大型语言模型（LLMS），以在各个领域生成合成数据。但是，医疗保健中的一个重大挑战是确保合成健康记录可靠地概括在该领域长期存在的不同医院。在这项工作中，我们评估了用于生成合成数据的商业LLM的当前状态，并研究了生成过程的多个方面，以识别这些模型脱颖而出以及它们不足的领域。我们从这项工作中的主要发现是，尽管LLM可以可靠地为较小的功能子集生成合成健康记录，但随着数据维度的增加，它们很难保留现实的分布和相关性，最终限制了它们跨越各种医院设置的能力。

Title: Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data

Authors: Wei Zou, Sen Yang, Yu Bao, Shujian Huang, Jiajun Chen, Shanbo Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14669
Pdf URL: https://arxiv.org/pdf/2504.14669
Copy Paste: [[2504.14669]] Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data(https://arxiv.org/abs/2504.14669)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) has reshaped machine translation (MT), but multilingual MT still relies heavily on parallel data for supervised fine-tuning (SFT), facing challenges like data scarcity for low-resource languages and catastrophic forgetting. To address these issues, we propose TRANS-ZERO, a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of LLM. TRANS-ZERO combines Genetic Monte-Carlo Tree Search (G-MCTS) with preference optimization, achieving strong translation performance that rivals supervised methods. Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions. Further analysis reveals that G-MCTS itself significantly enhances translation quality by exploring semantically consistent candidates through iterative translations, providing a robust foundation for the framework's succuss.
摘要：大语言模型（LLMS）的兴起具有重塑机器翻译（MT），但多语言MT仍然在很大程度上依赖于平行数据进行监督微调（SFT），面临着诸如低资源语言和灾难性遗忘的数据稀缺之类的挑战。为了解决这些问题，我们提出了Trans-Zero，这是一个自我播放框架，仅利用单语言数据和LLM的内在多语言知识。 Trans-Zero将遗传蒙特卡洛树搜索（G-MCT）与偏好优化相结合，达到了强大的翻译性能，可与受监督的方法媲美。实验表明，这种方法不仅与在大规模平行数据上训练的模型的性能相匹配，而且在非英语翻译方向上擅长。进一步的分析表明，G-MCT本身可以通过迭代翻译探索语义一致的候选者来显着提高翻译质量，从而为该框架的Succuss提供了强大的基础。

Title: FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models

Authors: Mehrnoush Shamsfard, Zahra Saaberi, Mostafa Karimi manesh, Seyed Mohammad Hossein Hashemi, Zahra Vatankhah, Motahareh Ramezani, Niki Pourazin, Tara Zare, Maryam Azimi, Sarina Chitsaz, Sama Khoraminejad, Morteza Mahdavi Mortazavi, Mohammad Mahdi Chizari, Sahar Maleki, Seyed Soroush Majd, Mostafa Masumi, Sayed Ali Musavi Khoeini, Amir Mohseni, Sogol Alipour
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14690
Pdf URL: https://arxiv.org/pdf/2504.14690
Copy Paste: [[2504.14690]] FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models(https://arxiv.org/abs/2504.14690)
Keywords: language model, llm
Abstract: Research on evaluating and analyzing large language models (LLMs) has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current language models are still far from being able to solve this benchmark
摘要：评估和分析大语言模型（LLM）的研究对诸如英语等资源丰富的语言已经广泛，但是它们在波斯语等语言中的表现却少得多。本文介绍了Farseval-Pkbets基准，这是Farseval项目的一个子集，用于评估波斯语中的大型语言模型。该基准由4000个问题和答案组成，包括多种选择，简短答案和描述性答复。它涵盖了广泛的领域和任务，包括医学，法律，宗教，波斯语，百科全书知识，人类偏好，社会知识，伦理和偏见，文本生成以及尊重他人的权利。该Bechmark结合了与波斯语和伊朗有关的语言学，文化和地方考虑。为了确保对当前LLM的问题具有挑战性，使用此基准评估了三种模型-Llama3-70B，Persianmind和Dorna。他们的平均准确性低于50％，这意味着它们为不到一半的问题提供了完全正确的答案。这些结果表明，当前的语言模型仍然无法解决该基准

Title: OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

Authors: Songtao Jiang, Yuan Wang, Sibo Song, Yan Zhang, Zijie Meng, Bohan Lei, Jian Wu, Jimeng Sun, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14692
Pdf URL: https://arxiv.org/pdf/2504.14692
Copy Paste: [[2504.14692]] OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding(https://arxiv.org/abs/2504.14692)
Keywords: language model
Abstract: The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.
摘要：医学视觉语言模型（MED-VLMS）的实际部署需要将文本数据与不同的视觉方式（包括2D/3D图像和视频）进行无缝集成，但现有模型通常采用不同模式的单独编码器。为了解决这一限制，我们提出了Omniv-Med，这是一个统一的多模式医学理解框架。我们的技术贡献是三倍：首先，我们构建了Omniv-Med-Instruct，这是一个全面的多模式医学数据集，其中包含252K教学样本，涵盖14个医学图像方式和11个临床任务。其次，我们设计了一个旋转位置自适应编码器，该编码器在统一体系结构中处理多分辨率的2D/3D图像和视频，与传统特定于模态的编码器不同。第三，我们引入了一种医学意识的令牌修剪机制，该机制利用体积数据（例如，连续的CT切片）和医疗视频中的空间冗余性，有效地降低了60 \％的视觉令牌，而无需性能降解。经验评估表明，Omniv-Med-7b在7个基准测试中实现了2D/3D医学成像和视频理解任务的最新性能。值得注意的是，我们的轻量级变体（Omniv-Med-1.5b）具有可比性的性能，而仅需要8 RTX3090 GPU才能进行训练和支持有效的长期推断。数据，代码和模型将发布。

Title: PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

Authors: Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14738
Pdf URL: https://arxiv.org/pdf/2504.14738
Copy Paste: [[2504.14738]] PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines(https://arxiv.org/abs/2504.14738)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.
摘要：大型语言模型（LLM）越来越多地部署在各种领域（例如财务，市场营销和电子商务）的专业生产数据处理管道中。但是，当他们在许多投入中运行生产时，它们通常无法按照说明或满足开发人员的期望。为了提高这些应用程序的可靠性，必须为LLM输出创建与管道并肩运行的断言或护栏是必不可少的。但是，确定捕获开发人员对任务要求的正确断言是具有挑战性的。在本文中，我们介绍了及时的versevals，这是一个2087 LLM管道提示的数据集，其中包含12623个相应的断言标准，使用我们的开源LLM Pipeline工具从开发人员那里采购。该数据集比以前的集合大5倍。使用及时词的持有测试拆分作为基准测试，我们评估了封闭和开源模型，以生成相关断言。值得注意的是，我们的微调Mistral和Llama 3型号平均比GPT-4O的表现平均超过20.93％，既可以减少延迟又提高了性能。我们认为，我们的数据集可以刺激LLM可靠性，一致性和及时工程的进一步研究。

Title: Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

Authors: Saniya Karwa, Navpreet Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14766
Pdf URL: https://arxiv.org/pdf/2504.14766
Copy Paste: [[2504.14766]] Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings(https://arxiv.org/abs/2504.14766)
Keywords: language model
Abstract: Understanding the inner workings of neural embeddings, particularly in models such as BERT, remains a challenge because of their high-dimensional and opaque nature. This paper proposes a framework for uncovering the specific dimensions of vector embeddings that encode distinct linguistic properties (LPs). We introduce the Linguistically Distinct Sentence Pairs (LDSP-10) dataset, which isolates ten key linguistic features such as synonymy, negation, tense, and quantity. Using this dataset, we analyze BERT embeddings with various methods, including the Wilcoxon signed-rank test, mutual information, and recursive feature elimination, to identify the most influential dimensions for each LP. We introduce a new metric, the Embedding Dimension Impact (EDI) score, which quantifies the relevance of each embedding dimension to a LP. Our findings show that certain properties, such as negation and polarity, are robustly encoded in specific dimensions, while others, like synonymy, exhibit more complex patterns. This study provides insights into the interpretability of embeddings, which can guide the development of more transparent and optimized language models, with implications for model bias mitigation and the responsible deployment of AI systems.
摘要：由于其高维和不透明的性质，了解神经嵌入的内部起作用，特别是在伯特等模型中，仍然是一个挑战。本文提出了一个框架，用于发现编码不同语言特性（LPS）的向量嵌入的特定维度。我们介绍了语言上不同的句子对（LDSP-10）数据集，该数据集隔离了十个关键语言特征，例如同义，否定，时态和数量。使用此数据集，我们使用各种方法分析了Bert嵌入，包括Wilcoxon签名的秩检验，共同信息和递归功能消除，以确定每个LP的最具影响力的维度。我们引入了一个新的度量标准，即嵌入维度影响（EDI）分数，该分数量化了每个嵌入维度与LP的相关性。我们的发现表明，某些属性（例如否定和极性）在特定维度上进行了牢固编码，而其他属性（例如同义词）表现出更复杂的模式。这项研究提供了对嵌入的解释性的见解，可以指导更透明和优化的语言模型的开发，这对减轻模型偏差和AI系统的负责部署有影响。

Title: Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Authors: Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, Ping Ma
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14772
Pdf URL: https://arxiv.org/pdf/2504.14772
Copy Paste: [[2504.14772]] Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions(https://arxiv.org/abs/2504.14772)
Keywords: language model, llm
Abstract: The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.
摘要：大型语言模型（LLM）的指数增长继续强调需要有效的策略来满足不断扩展的计算和数据需求。这项调查提供了对两个互补范式的全面分析：知识蒸馏（KD）和数据集蒸馏（DD），均旨在压缩LLM，同时保留其先进的推理能力和语言多样性。我们首先检查KD中的关键方法，例如特定于任务的对齐，基于理由的培训和多教学框架，以及通过基于优化的梯度匹配，潜在空间正则化和生成合成的DD技术，可通过优化的梯度匹配，潜在的梯度匹配，潜在的梯度匹配，高影响力数据集合成紧凑的高影响力数据集。在这些基础的基础上，我们探讨了集成KD和DD如何产生更有效和可扩展的压缩策略。这些方法共同解决了模型可伸缩性，建筑异质性和紧急LLM能力的持续挑战。我们进一步强调了医疗保健和教育等领域的应用，在这种领域，蒸馏可以在不牺牲绩效的情况下进行有效的部署。尽管取得了重大进展，但仍在保留紧急推理和语言多样性方面仍然存在公开挑战，从而有效适应不断发展的教师模型和数据集，并建立了全面的评估协议。通过综合方法学创新，理论基础和实际见解，我们的调查通过KD和DD原则的更严格整合来绘制通向可持续，资源有效的LLM的途径。

Title: Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends

Authors: Jiaxin GUO, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Zongyao Li, Hengchao Shang, Daimeng Wei, Hao Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14804
Pdf URL: https://arxiv.org/pdf/2504.14804
Copy Paste: [[2504.14804]] Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends(https://arxiv.org/abs/2504.14804)
Keywords: language model, llm
Abstract: With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.
摘要：随着深度学习技术的快速发展，机器翻译领域已经取得了重大进展，尤其是随着大语言模型（LLM）的出现，这些模型极大地推动了文档级翻译的进步。但是，准确评估文档级翻译的质量仍然是一个紧迫的问题。本文首先介绍了文档级翻译的开发状态以及评估的重要性，强调了自动评估指标在反映翻译质量和指导改进翻译系统方面的关键作用。然后，它对自动评估方案和指标的当前状态进行了详细分析，包括带有或不带有参考文本的评估方法以及传统指标，基于模型的指标和基于LLM的指标。随后，本文探讨了当前评估方法所面临的挑战，例如缺乏参考多样性，对句子级别对准信息的依赖以及LLM-AS-A-A-Judge方法的偏见，不准确以及缺乏可解释性。最后，本文着眼于评估方法的未来趋势，包括开发更友好的文档级别的评估方法和更强大的LLM-AS-A-A-A-Gudge方法，并提出了可能的研究方向，例如降低对句子级别信息的依赖性信息，以介绍多级别和多层评估方法，以介绍多级和多层评估方法，以介绍特定于培训模型和计算机翻译模型。这项研究旨在对文档级翻译的自动评估进行全面分析，并提供对未来发展的见解。

Title: On Self-improving Token Embeddings

Authors: Mario M. Kubek, Shiraj Pokharel, Thomas Böhme, Emma L. McDaniel, Herwig Unger, Armin R. Mikler
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.14808
Pdf URL: https://arxiv.org/pdf/2504.14808
Copy Paste: [[2504.14808]] On Self-improving Token Embeddings(https://arxiv.org/abs/2504.14808)
Keywords: language model
Abstract: This article introduces a novel and fast method for refining pre-trained static word or, more generally, token embeddings. By incorporating the embeddings of neighboring tokens in text corpora, it continuously updates the representation of each token, including those without pre-assigned embeddings. This approach effectively addresses the out-of-vocabulary problem, too. Operating independently of large language models and shallow neural networks, it enables versatile applications such as corpus exploration, conceptual search, and word sense disambiguation. The method is designed to enhance token representations within topically homogeneous corpora, where the vocabulary is restricted to a specific domain, resulting in more meaningful embeddings compared to general-purpose pre-trained vectors. As an example, the methodology is applied to explore storm events and their impacts on infrastructure and communities using narratives from a subset of the NOAA Storm Events database. The article also demonstrates how the approach improves the representation of storm-related terms over time, providing valuable insights into the evolving nature of disaster narratives.
摘要：本文介绍了一种新颖而快速的方法，用于精炼预训练的静态词，或者更普遍地是令牌嵌入。通过将邻近令牌的嵌入在文本语料库中，它不断更新每个令牌的表示形式，包括没有预先分配的嵌入的那些。这种方法也有效地解决了量不足的问题。它独立于大型语言模型和浅色神经网络操作，可以实现多功能应用程序，例如语料库探索，概念搜索和单词感官歧义。该方法旨在增强局部均匀语料库中的令牌表示，该词汇仅限于特定的域，与通用预训练的载体相比，词汇量更有意义。例如，该方法用于探索风暴事件及其对基础设施和社区的影响，使用NOAA风暴事件数据库的子集中的叙述。本文还展示了该方法如何改善随着时间的推移的代表，从而为灾难叙事的不断发展的本质提供了宝贵的见解。

Title: Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation

Authors: Jiajun Shen, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14856
Pdf URL: https://arxiv.org/pdf/2504.14856
Copy Paste: [[2504.14856]] Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation(https://arxiv.org/abs/2504.14856)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: While hallucinations of large language models could been alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both external and internal knowledge while providing trustworthy references, with 5 evaluation metrics focusing on 3 aspects: answer helpfulness, citation faithfulness, and trustworthiness. We introduce RAEL, the paradigm for our task, and also design INTRALIGN, an integrated method containing customary data generation and an alignment algorithm. Our experimental results show that our method achieves a better cross-scenario performance with regard to other baselines. Our extended experiments further reveal that retrieval quality, question types, and model knowledge have considerable influence on the trustworthiness in citation generation.
摘要：尽管大型语言模型的幻觉可以通过检索发明的一代和引文产生来缓解，但该模型如何利用内部知识仍然不透明，并且其产生的答案的可信度仍然值得怀疑。在这项工作中，我们介绍了上下文 - 优先增强引文生成任务，要求模型在提供外部知识和内部知识的同时引用引用，同时提供值得信赖的参考，其中5个评估指标重点介绍了3个方面：答案帮助，引用忠诚和可信赖性。我们介绍了雷尔（Rael），这是我们任务的范式，并且设计内字母内，一种包含习惯数据生成和对齐算法的集成方法。我们的实验结果表明，我们的方法在其他基线方面取得了更好的跨筛查性能。我们的扩展实验进一步表明，检索质量，问题类型和模型知识对引文产生的可信度有很大影响。

Title: Natural Fingerprints of Large Language Models

Authors: Teppei Suzuki, Ryokan Ri, Sho Takase
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14871
Pdf URL: https://arxiv.org/pdf/2504.14871
Copy Paste: [[2504.14871]] Natural Fingerprints of Large Language Models(https://arxiv.org/abs/2504.14871)
Keywords: language model, llm
Abstract: Large language models (LLMs) often exhibit biases -- systematic deviations from expected norms -- in their outputs. These range from overt issues, such as unfair responses, to subtler patterns that can reveal which model produced them. We investigate the factors that give rise to identifiable characteristics in LLMs. Since LLMs model training data distribution, it is reasonable that differences in training data naturally lead to the characteristics. However, our findings reveal that even when LLMs are trained on the exact same data, it is still possible to distinguish the source model based on its generated text. We refer to these unintended, distinctive characteristics as natural fingerprints. By systematically controlling training conditions, we show that the natural fingerprints can emerge from subtle differences in the training process, such as parameter sizes, optimization settings, and even random seeds. We believe that understanding natural fingerprints offers new insights into the origins of unintended bias and ways for improving control over LLM behavior.
摘要：大型语言模型（LLM）在其输出中经常表现出偏见 - 与预期规范的系统偏差。这些范围从明显的问题（例如不公平的响应）到可以揭示哪种模型产生这些模型的微妙模式。我们研究了引起LLM中可识别特征的因素。由于LLMS模型培训数据分布，因此训练数据的差异自然会导致特征是合理的。但是，我们的发现表明，即使在完全相同的数据上训练LLM，仍然可以根据其生成的文本来区分源模型。我们将这些意外，独特的特征称为天然指纹。通过系统地控制训练条件，我们表明自然的指纹可以从训练过程中的细微差异中出现，例如参数尺寸，优化设置，甚至随机种子。我们认为，了解自然指纹为意外偏见的起源以及改善对LLM行为的控制的方法提供了新的见解。

Title: Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Authors: Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14891
Pdf URL: https://arxiv.org/pdf/2504.14891
Copy Paste: [[2504.14891]] Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey(https://arxiv.org/abs/2504.14891)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.
摘要：在检索提升生成（RAG）中的最新进步通过将大型语言模型（LLM）与外部信息检索相结合，跨不同应用程序的准确，最新和可验证的文本生成，彻底改变了自然语言处理。但是，评估抹布系统由于结合了检索和发电组件的混合体系结构以及对LLM时代的动态知识来源的依赖而提出了独特的挑战。作为回应，本文提供了对抹布评估方法和框架的全面调查，系统地审查了传统和新兴评估方法，以实现LLM时代的系统性能，事实准确性，安全性，安全性和计算效率。我们还编译和分类了特定于破布的数据集和评估框架，对高影响力RAG研究中的评估实践进行了荟萃分析。据我们所知，这项工作代表了抹布评估，桥接传统和LLM驱动的方法的最全面的调查，并作为推进抹布开发的关键资源。

Title: CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs

Authors: Yingming Zheng, Xiaoliang Liu, Peng Wu, Li Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14905
Pdf URL: https://arxiv.org/pdf/2504.14905
Copy Paste: [[2504.14905]] CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs(https://arxiv.org/abs/2504.14905)
Keywords: language model, llm
Abstract: The rapid spread of misinformation, driven by digital media and AI-generated content, has made automatic claim verification essential. Traditional methods, which depend on expert-annotated evidence, are labor-intensive and not scalable. Although recent automated systems have improved, they still struggle with complex claims that require nuanced reasoning. To address this, we propose CRAVE, a Conflicting Reasoning Approach for explainable claim VErification, that verify the complex claims based on the conflicting rationales reasoned by large language models (LLMs). Specifically, CRAVE introduces a three-module framework. Ambiguity Elimination enchanced Evidence Retrieval module performs ambiguity elimination and entity-based search to gather relevant evidence related to claim verification from external sources like Wikipedia. Conflicting Perspective Reasoning and Preliminary Judgment module with LLMs adopts LLMs to reason rationales with conflicting stances about claim verification from retrieved evidence across four dimensions, i.e., direct evidence, semantic relationships, linguistic patterns, and logical reasoning and make a preliminary judgment. Finally, Small Language Model (SLM) based Judge module is fine-tuned to make use of preliminary judgment from LLMs to assess the confidence of the conflicting rationales and make a final authenticity judgment. This methodology allows CRAVE to capture subtle inconsistencies in complex claims, improving both the accuracy and transparency of claim verification. Extensive experiments on two public claim verification datasets demonstrate that our CRAVE model achieves much better performance than state-of-the-art methods and exhibits a superior capacity for finding relevant evidence and explaining the model predictions. The code is provided at this https URL.
摘要：数字媒体和AI生成的内容驱动的错误信息的迅速传播使自动索赔验证至关重要。取决于专家注册证据的传统方法是劳动密集型而不是可扩展的。尽管最近的自动化系统有所改善，但他们仍然在需要细微的推理的复杂主张方面挣扎。为了解决这个问题，我们提出了渴望，这是一种可解释的索赔验证的一种相互矛盾的推理方法，该方法基于大语言模型（LLMS）所推论的相互冲突的理由来验证复杂的主张。具体而言，Crave引入了三模块框架。歧义消除顽强的证据检索模块执行歧义消除和基于实体的搜索，以收集与Wikipedia这样的外部来源验证相关的相关证据。与LLMS的观点推理和初步判断模块相互矛盾，采用LLM，以对跨四个维度检索的证据的索赔验证进行矛盾的立场，即直接证据，语言关系，语言关系，逻辑推理和逻辑推理，并做出初步判断。最后，基于小型语言模型（SLM）法官模块进行了微调，以利用LLM的初步判断来评估相互矛盾的理由的信心并做出最终的真实性判断。这种方法允许渴望在复杂的索赔中捕捉细微的矛盾，从而提高了索赔验证的准确性和透明度。对两个公开索赔验证数据集进行的广泛实验表明，我们的渴望模型比最先进的方法更好的性能要好得多，并且具有找到相关证据并解释模型预测的较高能力。该代码在此HTTPS URL上提供。

Title: Evaluating LLMs on Chinese Topic Constructions: A Research Proposal Inspired by Tian et al. (2024)

Authors: Xiaodong Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14969
Pdf URL: https://arxiv.org/pdf/2504.14969
Copy Paste: [[2504.14969]] Evaluating LLMs on Chinese Topic Constructions: A Research Proposal Inspired by Tian et al. (2024)(https://arxiv.org/abs/2504.14969)
Keywords: language model, llm
Abstract: This paper proposes a framework for evaluating large language models (LLMs) on Chinese topic constructions, focusing on their sensitivity to island constraints. Drawing inspiration from Tian et al. (2024), we outline an experimental design for testing LLMs' grammatical knowledge of Mandarin syntax. While no experiments have been conducted yet, this proposal aims to provide a foundation for future studies and invites feedback on the methodology.
摘要：本文提出了一个框架，用于评估中国主题结构的大型语言模型（LLM），重点是它们对岛屿约束的敏感性。从Tian等人那里汲取灵感。（2024年），我们概述了测试LLMS语法语法语法知识的实验设计。尽管尚未进行实验，但该提案旨在为未来的研究提供基础，并邀请有关方法的反馈。

Title: Efficient Pretraining Length Scaling

Authors: Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, Xun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.14992
Pdf URL: https://arxiv.org/pdf/2504.14992
Copy Paste: [[2504.14992]] Efficient Pretraining Length Scaling(https://arxiv.org/abs/2504.14992)
Keywords: language model
Abstract: Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (\textit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. \textit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: \textit{PHD-SWA} employs sliding window attention to preserve local dependencies, while \textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.
摘要：大语言模型的最新进展证明了训练后长度缩放的有效性，但其在训练前的潜力仍然没有被忽视。我们提出了平行的隐藏解码变压器（\ textIt {phd} -transformer），这是一个新颖的框架，可在预训练期间进行有效的长度缩放，同时保持推理效率。 \ textIt {phd} -transformer通过创新的KV缓存管理策略来实现此目标，该策略区分原始令牌和隐藏的解码令牌。通过仅保留原始令牌的KV缓存，用于长期依赖性，同时立即使用使用后隐藏的解码令牌，我们的方法保持与Vanilla Transformer相同的KV缓存大小，同时启用有效的长度缩放。为了进一步提高性能，我们介绍了两个优化的变体：\ textIt {phd-swa}采用滑动窗口的注意来保留本地依赖性，而\ textit {phd-cswa}实现了块的滑动窗口的注意，以消除预填充时间的线性生长。广泛的实验表明，多个基准的一致改进。

Title: Stay Hungry, Stay Foolish: On the Extended Reading Articles Generation with LLMs

Authors: Yow-Fu Liou, Yu-Chien Tang, An-Zi Yen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15013
Pdf URL: https://arxiv.org/pdf/2504.15013
Copy Paste: [[2504.15013]] Stay Hungry, Stay Foolish: On the Extended Reading Articles Generation with LLMs(https://arxiv.org/abs/2504.15013)
Keywords: language model, llm
Abstract: The process of creating educational materials is both time-consuming and demanding for educators. This research explores the potential of Large Language Models (LLMs) to streamline this task by automating the generation of extended reading materials and relevant course suggestions. Using the TED-Ed Dig Deeper sections as an initial exploration, we investigate how supplementary articles can be enriched with contextual knowledge and connected to additional learning resources. Our method begins by generating extended articles from video transcripts, leveraging LLMs to include historical insights, cultural examples, and illustrative anecdotes. A recommendation system employing semantic similarity ranking identifies related courses, followed by an LLM-based refinement process to enhance relevance. The final articles are tailored to seamlessly integrate these recommendations, ensuring they remain cohesive and informative. Experimental evaluations demonstrate that our model produces high-quality content and accurate course suggestions, assessed through metrics such as Hit Rate, semantic similarity, and coherence. Our experimental analysis highlight the nuanced differences between the generated and existing materials, underscoring the model's capacity to offer more engaging and accessible learning experiences. This study showcases how LLMs can bridge the gap between core content and supplementary learning, providing students with additional recommended resources while also assisting teachers in designing educational materials.
摘要：创建教育材料的过程既耗时，又是对教育工作者的要求。这项研究通过自动化扩展阅读材料和相关课程建议的生成来探讨大语模型（LLMS）的潜力简化此任务。使用TED-ED挖掘更深的部分作为初步探索，我们研究了如何用上下文知识丰富补充文章并与其他学习资源相关。我们的方法首先从视频成绩单中生成扩展文章，利用LLMS包括历史见解，文化示例和说明性轶事。采用语义相似性排名的推荐系统确定了相关课程，然后是基于LLM的改进过程以增强相关性。最终文章的定制是为了无缝整合这些建议，以确保它们保持凝聚力和信息性。实验评估表明，我们的模型会产生高质量的内容和准确的课程建议，并通过命中率，语义相似性和连贯性等指标进行评估。我们的实验分析强调了生成的材料和现有材料之间的细微差异，强调了该模型提供更多引人入胜且可访问的学习体验的能力。这项研究展示了LLM如何弥合核心内容与补充学习之间的差距，从而为学生提供其他推荐资源，同时还可以帮助教师设计教育材料。

Title: LLMs as Data Annotators: How Close Are We to Human Performance

Authors: Muhammad Uzair Ul Haq, Davide Rigoni, Alessandro Sperduti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15022
Pdf URL: https://arxiv.org/pdf/2504.15022
Copy Paste: [[2504.15022]] LLMs as Data Annotators: How Close Are We to Human Performance(https://arxiv.org/abs/2504.15022)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately $7$B and $70$B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.
摘要：在NLP中，微调LLM对各种应用有效，但需要高质量的注释数据。但是，数据的手动注释是劳动密集型，耗时且昂贵的。因此，LLM越来越多地用于自动化该过程，通常采用文本学习（ICL），其中在提示中给出了与任务相关的一些示例。但是，手动选择上下文示例可能导致效率低下和次优模型性能。本文介绍了在各种数据集中，考虑了命名实体识别（NER）任务的多个嵌入模型的几个LLM的全面实验。评估包括大约$ 7 $ b和$ 70 $ b参数的模型，包括专有和非专有模型。此外，在利用检索功能的成功（RAG）的成功，它还考虑了一种通过自动检索上下文示例来解决ICL局限性的方法，从而增强了性能。结果突出了选择适当的LLM和嵌入模型，了解LLM尺寸和所需绩效之间的权衡的重要性，以及将研究努力引导到更具挑战性数据集的必要性。

Title: DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Authors: Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15027
Pdf URL: https://arxiv.org/pdf/2504.15027
Copy Paste: [[2504.15027]] DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models(https://arxiv.org/abs/2504.15027)
Keywords: language model, llm, agent
Abstract: Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.
摘要：在各种资源受限的情况下，提高计算效率并降低大型语言模型（LLM）的部署成本已成为关键挑战。在这项工作中，我们介绍了Distilqwen2.5，这是一个由公共QWEN2.5型号衍生而来的蒸馏，轻巧的LLM家族。与基于一系列蒸馏技术的原始模型相比，这些蒸馏型模型具有增强的指导遵循功能，这些技术结合了来自更大LLM的知识。在我们的工业实践中，我们首先利用具有多代理教师的能力不同的强大专有LLM来选择，改写和完善教学响应对，更适合学生LLMS学习。经过标准的微调，我们进一步利用了一种计算高效的模型融合方法，使学生模型能够逐步整合其教师的细粒度隐藏知识。实验评估表明，蒸馏模型具有比原始检查点更强的功能。此外，我们提出用例，以说明我们在现实情况下框架的应用。为了促进实际使用，我们已将所有Distilqwen2.5型号发布给开源社区。

Title: RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

Authors: Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15047
Pdf URL: https://arxiv.org/pdf/2504.15047
Copy Paste: [[2504.15047]] RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search(https://arxiv.org/abs/2504.15047)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $\approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at this https URL, supporting reproducibility and future research in LLM red-teaming.
摘要：大型语言模型（LLMS）具有显着的功能，但容易受到对抗性提示，即利用脆弱性产生不安全或有偏见的输出。现有的红色团队方法通常会面临可扩展性挑战，资源密集型要求或攻击策略中的多样性有限。我们提出了Rainbowplus，这是一个植根于进化计算的新型红色团队框架，通过自适应质量多样性（QD）搜索增强了对抗性及时的迅速产生，该搜索扩展了经典的进化算法，例如MAP-ELITE，以及针对语言模型量身定制的MAP-Elites。通过使用多元素档案存储各种高质量的提示和全面的健身功能来同时评估多个提示，Rainbowplus克服了单点档案档案的限制和在诸如彩虹队之类的先前QD方法中的限制。在六个基准数据集和四个开源LLMS中比较彩虹Plus与QD方法的实验表明，出色的攻击成功率（ASR）和多样性（不同得分$ \ 0.84 $），产生了多达100倍的独特提示（例如10,418 vs.100 vs.100 for Ministral-8b-8b-8b-insustruction-2410）。在Harmbench数据集上的九种最先进的方法中，Rainbowplus的平均ASR为81.1％，使Autodan-Turbo的平均ASR平均超过3.9％，并且更快（1.45 vs.13.50小时）。我们的开源实施促进了LLM安全性的进一步进步，为脆弱性评估提供了可扩展的工具。该HTTPS URL公开可用代码和资源，支持LLM红色团队的可重复性和未来研究。

Title: Testing LLMs' Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT

Authors: Joachim Minder, Guillaume Wisniewski, Natalie Kübler
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2504.15052
Pdf URL: https://arxiv.org/pdf/2504.15052
Copy Paste: [[2504.15052]] Testing LLMs' Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT(https://arxiv.org/abs/2504.15052)
Keywords: language model, gpt, llm, prompt, chat
Abstract: This study investigates the capabilities of large language models (LLMs), specifically ChatGPT, in annotating MT outputs based on an error typology. In contrast to previous work focusing mainly on general language, we explore ChatGPT's ability to identify and categorise errors in specialised translations. By testing two different prompts and based on a customised error typology, we compare ChatGPT annotations with human expert evaluations of translations produced by DeepL and ChatGPT itself. The results show that, for translations generated by DeepL, recall and precision are quite high. However, the degree of accuracy in error categorisation depends on the prompt's specific features and its level of detail, ChatGPT performing very well with a detailed prompt. When evaluating its own translations, ChatGPT achieves significantly poorer results, revealing limitations with self-assessment. These results highlight both the potential and the limitations of LLMs for translation evaluation, particularly in specialised domains. Our experiments pave the way for future research on open-source LLMs, which could produce annotations of comparable or even higher quality. In the future, we also aim to test the practical effectiveness of this automated evaluation in the context of translation training, particularly by optimising the process of human evaluation by teachers and by exploring the impact of annotations by LLMs on students' post-editing and translation learning.
摘要：这项研究调查了基于错误类型学的MT输出注释MT输出的大型语言模型（LLMS）的功能。与以前主要关注通用语言的工作相反，我们探讨了Chatgpt识别和分类专业翻译中错误的能力。通过测试两个不同的提示并基于自定义错误类型，我们将ChatGpt注释与DeepL和ChatGpt本身产生的翻译的人类专家评估进行了比较。结果表明，对于由DEEPL生成的翻译，召回和精度很高。但是，错误分类的准确性程度取决于提示的特定功能及其细节水平，ChatGpt在详细提示下表现良好。在评估自己的翻译时，Chatgpt取得了明显差的结果，从而揭示了自我评估的局限性。这些结果突出了LLM在翻译评估中的潜力和局限性，尤其是在专业领域。我们的实验为未来对开源LLM的研究铺平了道路，这可能会产生可比甚至更高质量的注释。将来，我们还旨在在翻译培训的背景下测试这种自动评估的实际有效性，尤其是通过优化教师的人类评估过程，并通过探索LLMS对学生对学生的后编辑和翻译学习的影响。

Title: Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models

Authors: K. Wong, B. Wu, S. Bulathwela, M. Cukurova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15093
Pdf URL: https://arxiv.org/pdf/2504.15093
Copy Paste: [[2504.15093]] Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models(https://arxiv.org/abs/2504.15093)
Keywords: language model, llm
Abstract: Detecting collaborative and problem-solving behaviours from digital traces to interpret students' collaborative problem solving (CPS) competency is a long-term goal in the Artificial Intelligence in Education (AIEd) field. Although multimodal data and advanced models are argued to have the potential to detect complex CPS behaviours, empirical evidence on their value remains limited with some contrasting evidence. In this study, we investigated the potential of multimodal data to improve model performance in diagnosing 78 secondary school students' CPS subskills and indicators in authentic educational settings. In particular, text embeddings from verbal data and acoustic embeddings from audio data were used in a multimodal classification model for CPS diagnosis. Both unimodal and multimodal transformer-based models outperformed traditional models in detecting CPS classes. Although the inclusion of multimodality did not improve the performance of traditional unimodal models, its integration into transformer-based models demonstrated improved performance for diagnosing social-cognitive CPS classes compared to unimodal transformer-based models. Based on the results, the paper argues that multimodality and the selection of a particular modelling technique should not be taken for granted to achieve the best performance in the automated detection of every CPS subskill and indicator. Rather, their value is limited to certain types of CPS indicators, affected by the complexity of the labels, and dependent on the composition of indicators in the dataset. We conclude the paper by discussing the required nuance when considering the value of LLMs and multimodality in automated CPS diagnosis, highlighting the need for human-AI complementarity, and proposing the exploration of relevant model architectures and techniques to improve CPS diagnosis in authentic educational contexts.
摘要：从数字痕迹中检测合作和解决问题的行为来解释学生的协作解决问题（CPS）的能力是人工智能（AIED）领域的长期目标。尽管多模式数据和高级模型被认为具有检测复杂的CPS行为的潜力，但其价值的经验证据仍然有限，并且与一些对比证据。在这项研究中，我们研究了多模式数据的潜力，以提高诊断78名中学生CPS的CPS子技能和指标的模型性能。特别地，在CPS诊断的多模态分类模型中使用了来自言语数据的文本嵌入和音频数据中的声学嵌入。基于单模式和多模式变压器的模型在检测CPS类方面都优于传统模型。尽管包含多峰值并不能提高传统单峰模型的性能，但与基于单峰变压器的模型相比，它在基于变形金刚的模型中的集成表现出改善的社会认知CPS类的性能。基于结果，本文认为，多模式和选择特定建模技术的选择不应被视为理所当然，以实现每个CPS亚技能和指标的自动检测中的最佳性能。相反，它们的价值仅限于某些类型的CPS指标，受标签的复杂性影响，并取决于数据集中指标的组成。在考虑LLM和多模态在自动化CPS诊断中的价值时，我们通过讨论所需的细微差别来结束本文的结论，突出了对人类互补性的需求，并提出了对相关模型架构和技术的探索，以改善正宗教育环境中CPS诊断的相关模型架构和技术。

Title: Kuwain 1.5B: An Arabic SLM via Language Injection

Authors: Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15120
Pdf URL: https://arxiv.org/pdf/2504.15120
Copy Paste: [[2504.15120]] Kuwain 1.5B: An Arabic SLM via Language Injection(https://arxiv.org/abs/2504.15120)
Keywords: language model, llm
Abstract: Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.
摘要：用新知识增强现有模型是AI开发的关键方面。本文介绍了一种将新语言集成到大语言模型（LLM）中的新方法。我们的方法成功地将以前看不见的目标语言纳入了现有的LLM，而不会损害其先验知识。我们通过将阿拉伯语注入主要接受英语训练的小型开源模型，培训了一个小型模型，该模型具有15亿个名为Kuwain的参数。我们的方法表明，阿拉伯语语言表现有了显着改善，在各种基准测试中的平均提高了8％，同时使用最少数量的原始模型数据保留了模型的现有知识。这为培训英语和阿拉伯语的综合模型提供了一种具有成本效益的替代方法。结果突出了有效的，有针对性的语言模型扩展的潜力，而无需大量的再培训或资源密集的过程。

Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

Authors: Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15133
Pdf URL: https://arxiv.org/pdf/2504.15133
Copy Paste: [[2504.15133]] EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models(https://arxiv.org/abs/2504.15133)
Keywords: language model, llm
Abstract: In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at this https URL along with a demonstration notebook. In addition, we provide a demo video at this https URL for a quick introduction.
摘要：在本文中，我们介绍了EasyEdit2，该框架旨在启用插件的可调性，以控制大型语言模型（LLM）行为。 EasyEdit2支持广泛的测试时间干预措施，包括安全，情感，个性，推理模式，事实和语言特征。与其前身不同，EasyEdit2具有专门为无缝模型转向设计的新体系结构。它包括关键模块，例如转向向量发生器和转向向量应用程序，该模块可以自动生成和应用转向向量，以影响模型的行为而无需修改其参数。 EasyEdit2的主要优点之一是其易于使用者不需要广泛的技术知识。仅凭一个例子，他们就可以有效地指导和调整模型的响应，从而使精确的控制均可访问和高效。从经验上讲，我们报告了不同LLM的模型转向性能，证明了这些技术的有效性。我们已经在此HTTPS URL的GitHub上发布了源代码以及演示笔记本。此外，我们在此HTTPS URL上提供了一个演示视频，以快速介绍。

Title: The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks

Authors: Joan C. Timoneda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15160
Pdf URL: https://arxiv.org/pdf/2504.15160
Copy Paste: [[2504.15160]] The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks(https://arxiv.org/abs/2504.15160)
Keywords: language model, gpt, llm, prompt
Abstract: Encoder-decoder Large Language Models (LLMs), such as BERT and RoBERTa, require that all categories in an annotation task be sufficiently represented in the training data for optimal performance. However, it is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. In this article, I describe this problem and propose a solution, the synthetic imputation approach. Leveraging a generative LLM (GPT-4o), this approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. This approach ensures that new synthetic texts are sufficiently different from the original texts to reduce overfitting, but retain the underlying substantive meaning of the examples to maximize out-of-sample performance. With 75 original examples or more, synthetic imputation's performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples. The synthetic imputation approach provides a novel role for generative LLMs in research and allows applied researchers to balance their datasets for best performance.
摘要：编码器 - 模型大型语言模型（LLMS），例如Bert和Roberta，要求注释任务中的所有类别在培训数据中充分表示，以获得最佳性能。但是，在建立高质量培训集时，通常很难在任务中找到所有类别的示例。在本文中，我描述了这个问题，并提出了一种解决方案，即合成插补方法。利用生成LLM（GPT-4O），此方法基于仔细提示和五个原始示例，并随机绘制了五个原始示例。这种方法可确保新的合成文本与原始文本完全不同，以减少过度拟合，但保留了示例的基本实质性含义，以最大程度地提高样本外性能。有75个原始示例或更多示例，合成插补的性能与完整的原始文本样本相提并论，并且过度拟合保持较低，可预测且可与50个原始样本更正。合成插补方法为生成LLM在研究中提供了新的作用，并允许应用研究人员平衡其数据集以获得最佳性能。

Title: Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Authors: Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.15205
Pdf URL: https://arxiv.org/pdf/2504.15205
Copy Paste: [[2504.15205]] Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges(https://arxiv.org/abs/2504.15205)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
摘要：检索增强的生成（RAG）使大型语言模型（LLMS）能够通过包含“地面真相”的源文档的引用来生成答案，从而减少了系统幻觉。抹布评估的关键因素是“支持”，是否支持答案中的信息是否支持答案。为此，我们对TREC 2024 RAG TRACK的45个参与者提交的45个参与者提交了一项大规模比较研究，将自动LLM法官（GPT-4O）与人类法官进行了比较，以进行支持评估。我们考虑了两个条件：（1）从头开始进行手动评估，以及（2）通过LLM预测后编辑后的手动评估。我们的结果表明，对于56％的手动评估评估，人类和GPT-4O预测非常匹配（以三层尺度）匹配，并在手册中升至具有后编辑后条件的72％。此外，通过仔细分析一项无偏见的研究中的分歧，我们发现与人类法官相比，独立的人类法官与GPT-4O相关，这表明LLM法官可以是支持评估的可靠选择。总而言之，我们提供对人类和GPT-4O错误的定性分析，以帮助指导未来的支持评估迭代。

Title: EvalAgent: Discovering Implicit Evaluation Criteria from the Web

Authors: Manya Wadhwa, Zayne Sprague, Chaitanya Malaviya, Philippe Laban, Junyi Jessy Li, Greg Durrett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15219
Pdf URL: https://arxiv.org/pdf/2504.15219
Copy Paste: [[2504.15219]] EvalAgent: Discovering Implicit Evaluation Criteria from the Web(https://arxiv.org/abs/2504.15219)
Keywords: language model, llm, prompt, agent
Abstract: Evaluation of language model outputs on structured writing tasks is typically conducted with a number of desirable criteria presented to human evaluators or large language models (LLMs). For instance, on a prompt like "Help me draft an academic talk on coffee intake vs research productivity", a model response may be evaluated for criteria like accuracy and coherence. However, high-quality responses should do more than just satisfy basic task requirements. An effective response to this query should include quintessential features of an academic talk, such as a compelling opening, clear research questions, and a takeaway. To help identify these implicit criteria, we introduce EvalAgent, a novel framework designed to automatically uncover nuanced and task-specific criteria. EvalAgent first mines expert-authored online guidance. It then uses this evidence to propose diverse, long-tail evaluation criteria that are grounded in reliable external sources. Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit (not directly stated in the user's prompt), yet specific (high degree of lexical precision). Further, EvalAgent criteria are often not satisfied by initial responses but they are actionable, such that responses can be refined to satisfy them. Finally, we show that combining LLM-generated and EvalAgent criteria uncovers more human-valued criteria than using LLMs alone.
摘要：在结构化写作任务上的语言模型输出评估通常是通过介绍给人类评估者或大型语言模型（LLM）的许多理想标准进行的。例如，在诸如“帮助我起草有关咖啡摄入量与研究生产力的学术演讲”之类的提示下，可以评估模型响应的准确性和连贯性等标准。但是，高质量的响应不仅应满足基本任务要求。对此查询的有效回答应包括学术演讲的典型特征，例如引人入胜的开放，清晰的研究问题和外卖。为了帮助确定这些隐式标准，我们引入了评估，这是一个新型框架，旨在自动发现细微差别和特定于任务的标准。评估首先地雷专家撰写了在线指导。然后，它使用这些证据提出了以可靠的外部来源为基础的多样化的长尾评估标准。我们的实验表明，评估者产生的扎根标准通常是隐式的（在用户提示中未直接说明），而是特定的（高度的词汇精度）。此外，评估标准通常无法通过初始响应来满足，但它们是可行的，因此可以完善响应以满足它们。最后，我们表明，与单独使用LLM相比，结合LLM生成和评估标准的人为标准更多。

Title: Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions

Authors: Saffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, Deep Ganguli
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15236
Pdf URL: https://arxiv.org/pdf/2504.15236
Copy Paste: [[2504.15236]] Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions(https://arxiv.org/abs/2504.15236)
Keywords: language model
Abstract: AI assistants can impart value judgments that shape people's decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like "moral nihilism". While some values appear consistently across contexts (e.g. "transparency"), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, "harm prevention" emerges when Claude resists users, "historical accuracy" when responding to queries about controversial events, "healthy boundaries" when asked for relationship advice, and "human agency" in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.
摘要：人工智能助手可以赋予塑造人们的决策和世界观的价值判断，但凭经验对这些系统在实践中所依赖的价值观几乎不知所述。为了解决这个问题，我们开发了一种自下而上的，隐私的方法，以提取Claude 3和3.5模型在成千上万的现实世界相互作用中表现出的值（模型响应中规定或证明的规范考虑）。我们从经验上发现并分类为3,307 AI值，并研究它们如何随上下文而变化。我们发现克劳德（Claude）表达了许多实际和认识论的价值，通常支持亲社会的人类价值观，同时抵制诸如“道德虚无主义”之类的价值观。尽管某些价值观跨环境（例如“透明度”）始终如一地出现，但许多价值观更专业和上下文依赖性，反映了人类对话者及其多样化的上下文的多样性。例如，当克劳德（Claude）拒绝用户时，“预防危害”就会出现，当询问有关有争议的事件的查询，“健康边界”时，“历史准确性”时，就会出现“健康准确性”。通过在部署中提供AI值的第一个大规模经验映射，我们的工作为AI系统中的值进行了更扎实的评估和设计创造了基础。

Title: MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning

Authors: Yahan Yang, Soham Dan, Shuo Li, Dan Roth, Insup Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.15241
Pdf URL: https://arxiv.org/pdf/2504.15241
Copy Paste: [[2504.15241]] MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning(https://arxiv.org/abs/2504.15241)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we propose an approach to build a multilingual guardrail with reasoning. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-guided Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail consistently outperforms recent baselines across both in-domain and out-of-domain languages. The multilingual reasoning capability of our guardrail enables it to generate multilingual explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.
摘要：大型语言模型（LLM）容易受到诸如越狱之类的对抗攻击的影响，这些攻击可能引起有害或不安全的行为。在多语言环境中，这种漏洞会加剧，在多语言环境中，多语言安全一致的数据通常受到限制。因此，开发能够检测和过滤跨不同语言的不安全内容的护栏对于在现实世界应用程序中部署LLM至关重要。在这项工作中，我们提出了一种通过推理构建多语言护栏的方法。我们的方法包括：（1）合成多语言数据生成，这些数据生成在文化和语言上有细微的变体，（2）监督的微调，以及（3）课程指导的小组相对策略优化（GRPO）框架，以进一步提高性能。实验结果表明，我们的多语言护栏在内域和室外语言上始终优于最近的基线。我们的护栏的多语言推理能力使其能够生成多语言解释，这对于理解多种语言内容中的语言特定风险和歧义特别有用。

Title: Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Authors: Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.15253
Pdf URL: https://arxiv.org/pdf/2504.15253
Copy Paste: [[2504.15253]] Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators(https://arxiv.org/abs/2504.15253)
Keywords: language model, llm
Abstract: Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.
摘要：扩展测试时间计算或在推理期间为发电机提供大型语言模型（LLM）通常采用外部非生成评估者（即奖励模型）的帮助。同时，LLM-gudges是自然语言训练的模型，以产生评估和批评（说明），在自动评估中变得越来越流行。尽管法官取得了经验的成功，但在测试时间缩放设置中作为评估者的有效性在很大程度上是未知的。在本文中，我们在三个任务设置下介绍了测试时间缩放（JETTS）基准的法官评估（JETTS）基准，该评估评估了三个域（数学推理，代码生成和指令下面）的法官绩效：响应重读，渐变梁搜索和基于批评的响应改进。我们评估了8种不同的基本发电机模型（6.7b-72b参数）的10个不同的法官模型（7b-70b参数）。我们的基准表明，虽然法官在重读中与结果奖励模型具有竞争力，但它们始终比光束搜索程序中的过程奖励模型差。此外，尽管LLM-gudges独有，但他们的自然语言批评目前无效地指导发电机采取更好的响应。