2025-04-15

Title: Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Authors: Christopher Wolfram, Aaron Schein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08775
Pdf URL: https://arxiv.org/pdf/2504.08775
Copy Paste: [[2504.08775]] Layers at Similar Depths Generate Similar Activations Across LLM Architectures(https://arxiv.org/abs/2504.08775)
Keywords: llm
Abstract: How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not "obvious" either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.
摘要：独立训练的LLM使用的潜在空间如何相互关系？我们研究了24个开放量LLM的不同层的激活引起的最近的邻居关系，发现它们1）倾向于在模型中的一层倾向于不同，而2）在不同模型的相应层之间大约共享。权利要求2显示，这些最近的邻居关系不是任意的，因为它们在模型中共享，但是权利要求1表明它们也不是“显而易见的”，因为没有普遍共享的一组最近的邻居关系。总之，这些表明LLM会产生激活几何形状的进程，但是整个进展在很大程度上是在模型之间共享的，并伸展和挤压以适合不同的体系结构。

Title: From Tokens to Lattices: Emergent Lattice Structures in Language Models

Authors: Bo Xiong, Steffen Staab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08778
Pdf URL: https://arxiv.org/pdf/2504.08778
Copy Paste: [[2504.08778]] From Tokens to Lattices: Emergent Lattice Structures in Language Models(https://arxiv.org/abs/2504.08778)
Keywords: language model
Abstract: Pretrained masked language models (MLMs) have demonstrated an impressive capability to comprehend and encode conceptual knowledge, revealing a lattice structure among concepts. This raises a critical question: how does this conceptualization emerge from MLM pretraining? In this paper, we explore this problem from the perspective of Formal Concept Analysis (FCA), a mathematical framework that derives concept lattices from the observations of object-attribute relationships. We show that the MLM's objective implicitly learns a \emph{formal context} that describes objects, attributes, and their dependencies, which enables the reconstruction of a concept lattice through FCA. We propose a novel framework for concept lattice construction from pretrained MLMs and investigate the origin of the inductive biases of MLMs in lattice structure learning. Our framework differs from previous work because it does not rely on human-defined concepts and allows for discovering "latent" concepts that extend beyond human definitions. We create three datasets for evaluation, and the empirical results verify our hypothesis.
摘要：经过验证的蒙版语言模型（MLMS）表现出了理解和编码概念知识的令人印象深刻的能力，从而揭示了概念之间的晶格结构。这引发了一个关键的问题：该概念化如何从传销预处理中出现？在本文中，我们从形式概念分析（FCA）的角度探讨了这个问题，这是一个数学框架，它从对象属性关系的观察结果中得出了概念晶格。我们表明，MLM的目标隐含地学习了一个描述对象，属性及其依赖性的\ Emph {正式上下文}，从而可以通过FCA重建概念晶格。我们提出了一个新颖的框架，用于从预算的MLM中构建概念晶格的结构，并研究MLMS在晶格结构学习中的电感偏见的起源。我们的框架与以前的工作不同，因为它不依赖于人类定义的概念，而允许发现超越人类定义的“潜在”概念。我们创建了三个用于评估的数据集，经验结果验证了我们的假设。

Title: Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Authors: Ruoxin Xiong, Yanyu Wang, Suat Gunhan, Yimin Zhu, Charles Berryman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08779
Pdf URL: https://arxiv.org/pdf/2504.08779
Copy Paste: [[2504.08779]] Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams(https://arxiv.org/abs/2504.08779)
Keywords: language model, gpt, llm
Abstract: The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.
摘要：建筑管理（CM）项目的日益增长的复杂性，再加上严格的监管要求和劳动力短缺等挑战，需要专门的分析工具来简化项目工作流程并提高绩效。尽管大型语言模型（LLMS）在一般推理任务中表现出了出色的表现，但它们在应对CM特定挑战（例如精确的定量分析和监管解释）方面的有效性仍然不足。为了弥合这一差距，这项研究介绍了CMEXAMSET，这是一个全面的基准测试数据集，其中包括689个真实的多项选择问题，这些问题源自四项全国认可的CM认证考试。我们的零射门评估评估了总体准确性，主题领域（例如施工安全性），推理复杂性（单步和多步）以及问题格式（仅本文，图形参考和桌面引用）。结果表明，GPT-4O和Claude 3.7超过典型的人通过阈值（70％），平均精度分别为82％和83％。此外，这两个模型在单步任务上的表现都更好，精度为85.7％（GPT-4O）和86.7％（Claude 3.7）。多步任务更具挑战性，将绩效降低到76.5％和77.6％。此外，两个LLM都对图形引用的问题均显示出显着局限性，精度下降到约40％。我们的错误模式分析进一步表明，概念上的误解是最常见的（44.4％和47.9％），强调了对增强域特异性推理模型的需求。这些发现强调了LLM作为CM中有价值的补充分析工具的潜力，同时强调了对特定领域的改进的需求，并在复杂的决策中持续了人类的监督。

Title: Efficient Evaluation of Large Language Models via Collaborative Filtering

Authors: Xu-Xiang Zhong, Chao Yi, Han-Jia Ye
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.08781
Pdf URL: https://arxiv.org/pdf/2504.08781
Copy Paste: [[2504.08781]] Efficient Evaluation of Large Language Models via Collaborative Filtering(https://arxiv.org/abs/2504.08781)
Keywords: language model, llm
Abstract: With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.
摘要：随着大型语言模型（LLM）的发展，已经提出了许多基准测量和比较不同LLM的功能。但是，由于大量的测试实例及其缓慢的推理速度，评估LLM的昂贵。在本文中，我们旨在探讨如何根据其从基准中采样的少数实例进行评估结果，在给定基准上有效估计模型的真实性能。受推荐系统（RS）中协作过滤（CF）的启发，我们将LLM视为用户，并将测试实例视为项目，并提出了两阶段的方法。在第一阶段，我们将实例选择视为向用户推荐产品，以选择可以轻松区分模型性能的实例。在第二阶段，我们将性能预测视为RS中的评级预测问题，以预测目标LLM在未选择实例上的行为。在多个LLM和数据集上进行的实验表明，我们的方法可以准确估算目标模型的性能，同时在很大程度上减少其推理开销。

Title: Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Authors: Toqeer Ehsan, Thamar Solorio
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.08792
Pdf URL: https://arxiv.org/pdf/2504.08792
Copy Paste: [[2504.08792]] Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation(https://arxiv.org/abs/2504.08792)
Keywords: language model, llm
Abstract: Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.
摘要：命名的实体识别（NER）是自然语言处理（NLP）的基本任务（NLP），已显示出高资源语言的重大进步。但是，由于缺乏注释的数据集和预训练的语言模型（PLM）的有限表示，因此对低资源语言的研究仍在研究且具有挑战性。为了应对这些挑战，我们提出了一种数据增强技术，该技术在四种低资源巴基斯坦语言上产生了具有文化合理的句子和实验；乌尔都语，沙木，信德和帕什托。通过微调多语言掩盖的大语言模型（LLM），我们的方法表现出Shahmukhi和Pashto的NER性能的显着改善。我们进一步探讨了生成LLM的NER和数据增强功能。

Title: Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks

Authors: Xiaomei Zhang, Zhaoxi Zhang, Yanjun Zhang, Xufei Zheng, Leo Yu Zhang, Shengshan Hu, Shirui Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08798
Pdf URL: https://arxiv.org/pdf/2504.08798
Copy Paste: [[2504.08798]] Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks(https://arxiv.org/abs/2504.08798)
Keywords: language model
Abstract: Textual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance.
摘要：文本对抗示例对自然语言处理系统的可靠性构成了严重威胁。最近的研究表明，对抗性例子往往会偏离正常文本的基本流形，而预先训练的蒙版语言模型可以近似正常数据的歧管。这些发现激发了探索蒙版语言模型，以检测文本对抗性攻击。我们首先介绍了基于语言模型的检测（MLMD），利用掩盖语言建模（MLM）目标的掩盖和操作来诱导正常文本和对抗文本之间的多种变化差异。尽管MLMD达到了竞争性检测性能，但其详尽的一对一掩蔽策略仍引入了重要的计算开销。我们的后验分析表明，输入中的大量非关键词对于检测而言并不重要，而是消耗资源。在此基础上，我们介绍了梯度引导的MLMD（GradMLMD），该MLMD（GradMLMD）利用梯度信息在检测过程中识别和跳过非关键词，从而在不损害检测性能的情况下大大降低了资源消耗。

Title: Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models

Authors: Zhengke Sun, Hangwei Qian, Ivor Tsang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08808
Pdf URL: https://arxiv.org/pdf/2504.08808
Copy Paste: [[2504.08808]] Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models(https://arxiv.org/abs/2504.08808)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at this https URL.
摘要：大型语言模型（LLM）已应用于时间序列预测任务，利用预训练的语言模型作为骨干，并合并文本数据以提高LLMS在时间序列中的全面功能。但是，这些文本对解释真的有帮助吗？这项研究旨在研究此类文本合并的实际功效和解释性。通过一系列关于文本提示和文本原型的经验实验，我们的发现表明，存在两种方式之间的错位，并且文本信息在许多情况下并不能显着改善时间序列的预测性能。此外，可视化分析表明，现有框架学到的文本表示，当应用于时间序列数据时缺乏足够的解释性。我们进一步提出了一个名为语义匹配索引（SMI）的新型度量，以更好地评估我们事后解释性调查期间时间序列和文本之间的匹配度。我们的分析揭示了当前时间序列LLM中文本的不对对准和有限的解释性，我们希望这项研究能够提高人们对时间序列文本的解释性的认识。该代码可在此HTTPS URL上找到。

Title: CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization

Authors: Jing Yao, Xiaoyuan Yi, Jindong Wang, Zhicheng Dou, Xing Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08820
Pdf URL: https://arxiv.org/pdf/2504.08820
Copy Paste: [[2504.08820]] CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization(https://arxiv.org/abs/2504.08820)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user experience and mitigating cultural conflicts. Existing approaches develop culturally aligned LLMs primarily through fine-tuning with massive carefully curated culture-specific corpora. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture's core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce CAReDiO, a novel cultural data construction framework. Specifically, CAReDiO utilizes powerful LLMs to automatically generate cultural conversation data, where both the queries and responses are further optimized by maximizing representativeness and distinctiveness. Using CAReDiO, we construct a small yet effective dataset, covering five cultures, and compare it with several recent cultural corpora. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.
摘要：随着大型语言模型（LLMS）更深入地融入各个地区的人类生活，将它们与多元文化保持一致，对于改善用户体验和减轻文化冲突至关重要。现有的方法主要是通过大量精心策划的特定文化特定的语料库进行微调来发展文化统一的LLM。然而，受文化理论的启发，我们确定了这些数据集面临的两个主要挑战：（1）代表性：这些语料库未能完全捕获目标文化的核心特征，从而导致计算浪费；（2）独特性：他们努力将给定文化的独特细微差别与其他相关模式的共同模式区分开来，阻碍了精确的文化建模。为了应对这些挑战，我们介绍了一种新型的文化数据构建框架Caredio。具体而言，Caredio利用强大的LLM自动生成文化对话数据，在该数据中，查询和响应都可以通过最大化代表性和独特性来进一步优化。使用Caredio，我们构建了一个小而有效的数据集，涵盖了五种文化，并将其与最近的几个文化语料库进行了比较。广泛的实验表明，我们的方法会产生更有效的数据，并实现了只有100个培训样本的文化对准，从而提高了性能和效率。

Title: SD$^2$: Self-Distilled Sparse Drafters

Authors: Mike Lasby, Nish Sinnadurai, Valavan Manohararajah, Sean Lie, Vithursan Thangarasa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08838
Pdf URL: https://arxiv.org/pdf/2504.08838
Copy Paste: [[2504.08838]] SD$^2$: Self-Distilled Sparse Drafters(https://arxiv.org/abs/2504.08838)
Keywords: language model, llm
Abstract: Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a $\times$1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.
摘要：投机解码是减少大语言模型（LLMS）潜伏期的强大技术，提供了容忍故障的框架，该框架能够使用高度压缩的草稿模型。在这项工作中，我们引入了自缩稀疏的稀疏起草人（SD $^2 $），这是一种新型方法，它利用自DATA蒸馏和细颗粒的重量稀疏性，以产生高效且良好的草稿模型。 SD $^2 $系统地增强了令牌的接受率，同时显着降低了多重蓄能操作（MAC），即使在通用辅助生成（UAG）设置中，草稿和目标模型来自不同的模型系列。在Llama-3.1-70B目标模型上，SD $^2 $提供了$ \ times $ 1.59的平均接收长度（MAL），与层 - 预截止的草稿型号相比，MAL与密集的型号相比，MAL降低了8.36％。我们的结果突出了稀疏感知微调和压缩策略的潜力，以提高LLM推理效率，同时保持与目标模型的一致性。

Title: Forecasting Communication Derailments Through Conversation Generation

Authors: Yunfan Zhang, Kathleen McKeown, Smaranda Muresan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08905
Pdf URL: https://arxiv.org/pdf/2504.08905
Copy Paste: [[2504.08905]] Forecasting Communication Derailments Through Conversation Generation(https://arxiv.org/abs/2504.08905)
Keywords: language model, llm
Abstract: Forecasting communication derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models' success at identifying offensive speech present in conversations, they struggle to forecast future communication derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the communication outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English communication derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.
摘要：预测沟通出轨在现实世界中可能很有用，例如在线内容审核，解决冲突和业务谈判。但是，尽管语言模型在确定对话中存在的令人反感的演讲方面取得了成功，但他们仍在努力预测未来的沟通出轨。与以前的工作仅根据过去的对话历史来预测对话结果的先前工作相反，我们的方法采样了以后的多个对话轨迹，该对话轨迹是根据现有的对话历史记录的，使用微调的LLM进行了示例。它根据这些轨迹的共识来预测沟通结果。我们还试验了利用社会语言属性，这些属性反映了转向级的对话动态，作为引起未来对话时的指导。我们未来的对话轨迹的方法超过了英语交流出轨预测基准的最新结果，并且在消融研究中表现出显着的准确性提高。

Title: Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

Authors: Mehmet Arif Demirtaş, Claire Zheng, Max Fowler, Kathryn Cunningham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08958
Pdf URL: https://arxiv.org/pdf/2504.08958
Copy Paste: [[2504.08958]] Generating Planning Feedback for Open-Ended Programming Exercises with LLMs(https://arxiv.org/abs/2504.08958)
Keywords: language model, gpt, llm
Abstract: To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.
摘要：要完成开放式编程练习，学生需要计划高级解决方案并使用适当的语法实施它。但是，这些问题通常是根据测试案例提交的最终提交的正确性而自动化的，并且学生无法获得其计划过程的反馈。大型语言模型（LLM）可能能够通过检测到语法错误提交的总体代码结构来生成此反馈。为此，我们提出了一种检测使用LLMS的学生计划中存在哪些高级目标和模式（即编程计划）的方法。我们表明，完整的GPT-4O模型和小型变体（GPT-4O-MINI）都可以以显着的精度检测这些计划，超过了受传统代码分析方法启发的基线。我们进一步表明，较小，成本效益的变体（GPT-4O-MINI）在微调后与最先进的（GPT-4O）取得了相当的结果，从而对实时分级产生了有希望的含义。这些较小的型号可以将其纳入自动化器中，以进行开放式代码编写练习，以便为学生的隐式计划技能提供反馈，即使他们的程序在句法上不正确。此外，LLM可能在为学生开始以一组高级解决方案步骤开始并迭代计算输出（例如数学和物理问题）的其他领域的问题可能有用。

Title: A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

Authors: Kseniia Petukhova, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08961
Pdf URL: https://arxiv.org/pdf/2504.08961
Copy Paste: [[2504.08961]] A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models(https://arxiv.org/abs/2504.08961)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have shown promise in automating discourse annotation for conversations. While manually designing tree annotation schemes significantly improves annotation quality for humans and models, their creation remains time-consuming and requires expert knowledge. We propose a fully automated pipeline that uses LLMs to construct such schemes and perform annotation. We evaluate our approach on speech functions (SFs) and the Switchboard-DAMSL (SWBD-DAMSL) taxonomies. Our experiments compare various design choices, and we show that frequency-guided decision trees, paired with an advanced LLM for annotation, can outperform previously manually designed trees and even match or surpass human annotators while significantly reducing the time required for annotation. We release all code and resultant schemes and annotations to facilitate future research on discourse annotation.
摘要：大型语言模型（LLM）的最新进展已显示出在对话的话语注释自动化时的希望。虽然手动设计树注释方案可显着提高人类和模型的注释质量，但它们的创造仍然耗时，需要专业知识。我们提出了一种全自动管道，该管道使用LLMS构建此类方案并执行注释。我们评估了语音功能（SFS）和DAMSL（SWBD-DAMSL）分类法的方法。我们的实验比较了各种设计选择，我们表明频率引导的决策树与高级LLM配对以注释，可以胜过以前手动设计的树，甚至匹配或超过人体注释，同时大大减少注释所需的时间。我们发布所有代码和结果方案以及注释，以促进对话语注释的未来研究。

Title: From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy

Authors: Adrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09049
Pdf URL: https://arxiv.org/pdf/2504.09049
Copy Paste: [[2504.09049]] From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy(https://arxiv.org/abs/2504.09049)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at this https URL.
摘要：喜剧是我们生活时代的深刻反映，是人类互动的主食。鉴于大型语言模型（LLM）的广泛采用，幽默和AI的交集并没有笑。人类计算机互动的自然性的进步与AI系统能力理解幽默的能力的改善相关。在这项研究中，我们评估了模型准确地识别出幽默引号的能力。站立喜剧独特的喜剧叙事使其成为提高喜剧理解的整体自然性的理想数据集。我们提出了一种新型的幽默检测指标，旨在评估各种提示之间的LLM，以提取幽默的打孔线。该指标具有模块化结构，该结构提供了三种不同的评分方法 - 模糊字符串匹配，句子嵌入和子空间相似性 - 可提供对模型性能的总体评估。将模型的结果与同一任务中的人类评估者的结果进行了比较。我们的指标表明，无论迅速的工程，领先的模型，Chatgpt，Claude和Deepseek，幽默检测最多都达到51％。值得注意的是，这种表现超过了成绩41％的人的表现。对人类评估者和LLM的分析揭示了一致性的可变性，强调了幽默固有的主观性以及从实时绩效转录本中提取幽默报价所涉及的复杂性。可在此HTTPS URL上找到代码。

Title: Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models

Authors: Matt Grenander, Siddharth Varia, Paula Czarnowska, Yogarshi Vyas, Kishaloy Halder, Bonan Min
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09071
Pdf URL: https://arxiv.org/pdf/2504.09071
Copy Paste: [[2504.09071]] Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models(https://arxiv.org/abs/2504.09071)
Keywords: language model, hallucination
Abstract: Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts.
摘要：计划指导的摘要试图通过将生成的摘要接地到源文本中，以减少小语言模型（SLM）的幻觉，通常是通过针对细粒细节（例如日期或命名实体）来减少源文本的摘要。在这项工作中，我们研究了SLM中基于计划的方法是否改善了长期文档，叙事任务的摘要。叙事文本的长度和复杂性通常意味着它们很难忠实地总结。我们分析针对细粒细节的现有计划引导的解决方案，并提出我们自己的高级，基于叙事的计划制定。我们的结果表明，在不计划即将质量或忠诚的情况下，这两种方法都在基线上都无法显着改善。人类评估表明，尽管计划指导的方法通常符合其计划，但与摘要相比，计划同样有可能包含幻觉。结果，计划指导的摘要与没有计划的模型一样不忠。我们的作品是计划指导的总结方法的警示故事，尤其是对于诸如叙事文本之类的长期复杂领域。

Title: A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents

Authors: Akanksha Mehndiratta, Krishna Asawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09073
Pdf URL: https://arxiv.org/pdf/2504.09073
Copy Paste: [[2504.09073]] A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents(https://arxiv.org/abs/2504.09073)
Keywords: agent
Abstract: Multiturn dialogue models aim to generate human-like responses by leveraging conversational context, consisting of utterances from previous exchanges. Existing methods often neglect the interactions between these utterances or treat all of them as equally significant. This paper introduces a discourse-aware framework for response selection in retrieval-based dialogue systems. The proposed model first encodes each utterance and response with contextual, positional, and syntactic features using Multi-view Canonical Correlation Analysis (MCCA). It then learns discourse tokens that capture relationships between an utterance and its surrounding turns in a shared subspace via Canonical Correlation Analysis (CCA). This two-step approach effectively integrates semantic and syntactic features to build discourse-level understanding. Experiments on the Ubuntu Dialogue Corpus demonstrate that our model achieves significant improvements in automatic evaluation metrics, highlighting its effectiveness in response selection.
摘要：Multiturn对话模型旨在通过利用对话性上下文来产生类似人类的响应，包括先前交流的话语。现有方法通常忽略这些话语之间的相互作用或将所有话语视为同样重要的。本文在基于检索的对话系统中介绍了一个响应选择的话语感知框架。提出的模型首先使用多视图规范相关分析（MCCA）使用上下文，位置和句法特征编码每个话语和响应。然后，它学习了话语令牌，这些令牌通过规范相关分析（CCA）在共享子空间中捕获了话语与周围环境之间的关系。这种两步的方法有效地整合了语义和句法特征，以建立话语级别的理解。关于Ubuntu对话语料库的实验表明，我们的模型在自动评估指标方面取得了重大改进，突出了其在响应选择方面的有效性。

Title: Enhancing Dialogue Systems with Discourse-Level Understanding Using Deep Canonical Correlation Analysis

Authors: Akanksha Mehndiratta, Krishna Asawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09094
Pdf URL: https://arxiv.org/pdf/2504.09094
Copy Paste: [[2504.09094]] Enhancing Dialogue Systems with Discourse-Level Understanding Using Deep Canonical Correlation Analysis(https://arxiv.org/abs/2504.09094)
Keywords: agent
Abstract: The evolution of conversational agents has been driven by the need for more contextually aware systems that can effectively manage dialogue over extended interactions. To address the limitations of existing models in capturing and utilizing long-term conversational history, we propose a novel framework that integrates Deep Canonical Correlation Analysis (DCCA) for discourse-level understanding. This framework learns discourse tokens to capture relationships between utterances and their surrounding context, enabling a better understanding of long-term dependencies. Experiments on the Ubuntu Dialogue Corpus demonstrate significant enhancement in response selection, based on the improved automatic evaluation metric scores. The results highlight the potential of DCCA in improving dialogue systems by allowing them to filter out irrelevant context and retain critical discourse information for more accurate response retrieval.
摘要：对话代理的演变是由对可以有效管理扩展相互作用对话的更上下文意识的系统的需求所驱动的。为了解决现有模型在捕获和利用长期对话历史中的局限性，我们提出了一个新颖的框架，该框架将深层的规范相关分析（DCCA）整合起来，以进行话语级别的理解。该框架学习了话语令牌，以捕捉话语与周围环境之间的关系，从而更好地理解长期依赖性。基于改进的自动评估度量评分，关于Ubuntu对话语料库的实验表明，在响应选择方面表现出显着的增强。结果突出了DCCA通过允许其滤除不相关的环境并保留关键话语信息以更准确的响应检索来改善对话系统的潜力。

Title: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Authors: Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09130
Pdf URL: https://arxiv.org/pdf/2504.09130
Copy Paste: [[2504.09130]] VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search(https://arxiv.org/abs/2504.09130)
Keywords: language model
Abstract: Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
摘要：大型视觉模型的最新进展展示了出色的功能。但是，当人类通常通过视觉辅助和故意，逐步思考来解决的复杂推理任务时，他们通常会步履蹒跚。尽管现有方法探索了基于文本的慢速思维或基本的视觉援助，但它们却没有捕获人类视觉语言推理过程的复杂，交错的性质。为了克服这些局限性并受到人类认知缓慢思考的机制的启发，我们引入了Visuothink，这是一个新颖的框架，无缝地整合了视觉空间和语言领域。 Visuothink通过实现渐进的视觉文本推理来促进多模式的慢速思考，并通过搜索搜索结合测试时间缩放。广泛的实验表明，Visuothink通过推理时间缩放显着增强了推理能力，即使没有微调，在涉及几何形状和空间推理的任务中实现了最新性能。

Title: Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

Authors: Haotian Ye, Himanshu Jain, Chong You, Ananda Theertha Suresh, Haowei Lin, James Zou, Felix Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09135
Pdf URL: https://arxiv.org/pdf/2504.09135
Copy Paste: [[2504.09135]] Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models(https://arxiv.org/abs/2504.09135)
Keywords: language model
Abstract: In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.
摘要：在大型语言模型的实际应用中，通常需要限制输出：从预定义的产品或文档集中选择项目，生成符合安全标准的短语，或符合专门的格式化样式。为了控制这一代，受限的解码已被广泛采用。但是，在基于GPU的模型推理范例下，现有的基于前缀的约束解码效率低下，并且它将意想不到的偏见引入输出分布中。本文介绍了具有基于GPU的平行前缀验证（PPV）的约束解码（DIS）的动态重要性采样，这是一种新型算法，利用动态重要性采样来实现理论上保证的渐近无偏见，并克服了前free的效率低下。广泛的实验表明，在效率和产出质量方面，我们方法比现有方法的优越性。这些结果突出了我们方法在遵守特定约束至关重要的应用中改善受约束生成的潜力。

Title: Can postgraduate translation students identify machine-generated text?

Authors: Michael Farrell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09164
Pdf URL: https://arxiv.org/pdf/2504.09164
Copy Paste: [[2504.09164]] Can postgraduate translation students identify machine-generated text?(https://arxiv.org/abs/2504.09164)
Keywords: gpt, chat
Abstract: Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing both machine and traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies typically found in synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated (ChatGPT-4o). The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified the same textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.
摘要：鉴于生成人工智能的使用日益增长，作为创建多语言内容并绕过机器和传统翻译方法的工具，这项研究探讨了经过语言训练的个人识别从人类写的文本（HT）识别机器生成的输出的能力。经过简短的培训课程，介绍了通常在合成文本（ST）中发现的文本异常之后，二十三个研究生翻译的学生分析了意大利散文的摘录，并分配了可能性分数，以表明他们相信他们是人写的还是人为撰写的或AI生成的（CHETGPT-4O）。结果表明，平均而言，学生努力区分HT和ST，只有两个参与者达到了明显的准确性。仔细分析表明，学生经常在HT和ST中识别出相同的文本异常，尽管诸如低爆发和自相矛盾之类的特征与ST更频繁地相关。这些发现表明需要改进准备培训。此外，该研究提出了有关编辑合成文本的必要性以使其听起来更像人性化的问题，并建议进一步研究以确定AI生成的文本是否已经足够自然，不需要进一步的细化。

Title: Langformers: Unified NLP Pipelines for Language Models

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09170
Pdf URL: https://arxiv.org/pdf/2504.09170
Copy Paste: [[2504.09170]] Langformers: Unified NLP Pipelines for Language Models(https://arxiv.org/abs/2504.09170)
Keywords: language model, llm, agent
Abstract: Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: this https URL
摘要：基于变压器的语言模型已彻底改变了自然语言处理（NLP）的领域。但是，使用这些模型通常涉及导航多个框架和工具，以及编写重复的样板代码。这种复杂性可以阻止非编程者和初学者，甚至减慢经验丰富的开发人员的原型制作。为了应对这些挑战，我们介绍了Langformers，这是一个开源Python库，旨在通过大型语言模型（LLM）的统一，基于工厂的界面简化NLP管道和蒙版语言模型（MLM）任务。 Langformers将对话式AI，MLM预处理，文本分类，句子嵌入/重新播放，数据标记，语义搜索和知识蒸馏融合到凝聚力的API中，支持拥抱面孔和Ollama等流行的平台。关键创新包括：（1）抽象培训，推理和部署复杂性的特定于任务的工厂；（2）对话剂的内置记忆和流式传输；（3）轻巧的模块化设计优先考虑易用性。文档：此HTTPS URL

Title: Parameterized Synthetic Text Generation with SimpleStories

Authors: Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, Dan Braun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09184
Pdf URL: https://arxiv.org/pdf/2504.09184
Copy Paste: [[2504.09184]] Parameterized Synthetic Text Generation with SimpleStories(https://arxiv.org/abs/2504.09184)
Keywords: prompt
Abstract: We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.
摘要：我们介绍了简单的综合故事数据集，简单语言，每个用英语和日语组成200万个故事。我们的方法采用了具有多个抽象的特征的提示的参数化，从而可以系统地控制故事特征，以确保广泛的句法和语义多样性。我们的方法基于微小的数据集中的局限性，表明可以在大规模的合成文本生成中同时实现简单性和多样性。

Title: Feature-Aware Malicious Output Detection and Mitigation

Authors: Weilong Dong, Peiguang Li, Yu Tian, Xinyi Zeng, Fengdi Li, Sirui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09191
Pdf URL: https://arxiv.org/pdf/2504.09191
Copy Paste: [[2504.09191]] Feature-Aware Malicious Output Detection and Mitigation(https://arxiv.org/abs/2504.09191)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models' standard generation capabilities.
摘要：大型语言模型（LLM）的快速发展为各个领域带来了重大好处，同时引入了重大风险。尽管通过强化学习进行了微调，但LLMS缺乏辨别恶意内容的能力，从而限制了他们防止越狱的辩护。为了解决这些安全问题，我们提出了一种有害响应拒绝（FMM）的特征感知方法，该方法检测到模型的特征空间中存在恶意特征，并自适应地调整了模型的拒绝机制。通过采用简单的歧视者，我们在解码阶段检测到潜在的恶意性状。在检测指示有毒令牌的特征后，FMM会再生当前令牌。通过采用激活补丁，在随后的令牌生成过程中掺入了另一个拒绝向量，将模型转向拒绝响应。实验结果证明了我们在多种语言模型和各种攻击技术中的方法的有效性，同时至关重要地维护了模型的标准生成能力。

Title: Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation

Authors: Owen Patterson, Chee Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09305
Pdf URL: https://arxiv.org/pdf/2504.09305
Copy Paste: [[2504.09305]] Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation(https://arxiv.org/abs/2504.09305)
Keywords: language model
Abstract: In-Context Learning (ICL) empowers large language models to perform tasks by conditioning on a few input-output examples. However, the performance of ICL is highly sensitive to the selection of these demonstrations. While existing methods focus on similarity or contrastive selection, they often overlook the importance of diversity among the chosen examples. In this paper, we propose DiverseConE (Diversity-Enhanced Contrastive Example Selection), a novel approach for demonstration selection in in-context learning for machine translation. Our method builds upon contrastive selection by incorporating a diversity enhancement step based on embedding space dissimilarity. We conduct extensive experiments on the Llama2-7b model across four language pairs (English-Chinese, Chinese-English, Russian-German, German-Russian) in 1-shot and 3-shot settings, using COMET20 and COMET22 for evaluation. Our results demonstrate that DiverseConE consistently outperforms strong baseline methods, including random selection, BM25, TopK, and a state-of-the-art contrastive selection method. Further analysis, including diversity metrics and human evaluation, validates the effectiveness of our approach and highlights the benefits of considering demonstration diversity for improved translation quality.
摘要：内部文化学习（ICL）通过在一些输入输出示例进行调节来赋予大型语言模型来执行任务。但是，ICL的性能对这些示范的选择高度敏感。尽管现有方法着眼于相似性或对比选择，但它们经常忽略所选示例中多样性的重要性。在本文中，我们提出了Diversecone（多样性增强的对比示例选择），这是一种在机器翻译中的秘密学习中进行演示选择的新方法。我们的方法是基于对比度选择的，它通过基于嵌入空间差异的多样性增强步骤结合。我们使用comet20和comet22进行评估，在1次和3-Shot设置中，在四种语言对（英语 - 英语，俄罗斯 - 德国，德国俄罗斯，德国俄罗斯）上对Llama2-7b模型进行了广泛的实验。我们的结果表明，多样性始终优于强大的基线方法，包括随机选择，BM25，TOPK和最先进的对比选择方法。包括多样性指标和人类评估在内的进一步分析验证了我们方法的有效性，并突出了考虑演示多样性以改善翻译质量的好处。

Title: Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts

Authors: Emily Johnson, Xavier Holt, Noah Wilson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09309
Pdf URL: https://arxiv.org/pdf/2504.09309
Copy Paste: [[2504.09309]] Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts(https://arxiv.org/abs/2504.09309)
Keywords: language model, llm, prompt
Abstract: Legal multi-label classification is a critical task for organizing and accessing the vast amount of legal documentation. Despite its importance, it faces challenges such as the complexity of legal language, intricate label dependencies, and significant label imbalance. In this paper, we propose Legal-LLM, a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning. We reframe the multi-label classification task as a structured generation problem, instructing the LLM to directly output the relevant legal categories for a given document. We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores. Our experimental results demonstrate that Legal-LLM outperforms a range of strong baseline models, including traditional methods and other Transformer-based approaches. Furthermore, ablation studies and human evaluations validate the effectiveness of our approach, particularly in handling label imbalance and generating relevant and accurate legal labels.
摘要：法律多标签分类是组织和访问大量法律文件的关键任务。尽管它很重要，但它仍面临挑战，例如法律语言的复杂性，错综复杂的标签依赖性和重大标签失衡。在本文中，我们提出了一种法律限制，这是一种新颖的方法，该方法通过微调来利用大语言模型（LLMS）的指导跟踪功能。我们将多标签分类任务重新构架为结构化生成问题，指示LLM直接输出给定文档的相关法律类别。我们使用Micro-F1和Macro-F1分数在两个基准数据集（Posture50k和Eurolex57K）上评估我们的方法。我们的实验结果表明，Legal-LLM的表现优于一系列强大的基线模型，包括传统方法和其他基于变压器的方法。此外，消融研究和人类评估验证了我们方法的有效性，尤其是在处理标签不平衡并产生相关和准确的法律标签方面。

Title: QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Authors: Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, Junyi Jessy Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09373
Pdf URL: https://arxiv.org/pdf/2504.09373
Copy Paste: [[2504.09373]] QUDsim: Quantifying Discourse Similarities in LLM-Generated Text(https://arxiv.org/abs/2504.09373)
Keywords: language model, llm
Abstract: As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture $\textit{content}$ overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.
摘要：随着大型语言模型在各种写作任务上变得越来越有能力，它们产生独特和创造性内容的弱点将成为一个重大责任。尽管LLM具有生成涵盖各种主题的文本的能力，但我们旨在通过相似性指标正式化和量化的整体重复性感。文档之间的熟悉程度来自潜在的话语结构的持续存在。但是，现有的相似性指标取决于词汇叠加和句法模式很大程度上捕获了$ \ textit {content} $重叠，从而使它们不适合检测$ \ textit {结构} $相似性。我们在讨论中的问题（QUD）中介绍了基于语言理论的抽象，并质疑语义，以帮助量化话语进展的差异。然后，我们使用此框架来构建$ \ textbf {qudsim} $，这是一个可以检测文档之间的话语相似之处的相似度量。使用Qudsim，我们发现LLM经常在样本中重用话语结构（比人类更多），即使内容有所不同。此外，LLM不仅重复且结构均匀，而且在他们使用的结构类型中与人类作者有所不同。

Title: Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs

Authors: Kartik Ravisankar, Hyojung Han, Marine Carpuat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09378
Pdf URL: https://arxiv.org/pdf/2504.09378
Copy Paste: [[2504.09378]] Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs(https://arxiv.org/abs/2504.09378)
Keywords: language model, llm
Abstract: Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.
摘要：大型语言模型（LLMS）主要在英语文本上进行了预先培训，具有令人惊讶的多语言能力，但是驱动跨语性概括的机制仍然很少理解。这项工作调查了用不同语言编写的文本的表示形式与LLM在语言和实例级别上的自然语言理解任务和翻译任务的绩效相关。为此，我们介绍了跨语言对准指标，例如判别对准指数（DALI），以在实例级别量化判别任务。通过实验三个自然理解任务（Belebele，XstoryCloze，Xcopa）和机器翻译，我们发现，虽然跨语言的一致性指标与语言水平上的任务准确性密切相关，但样本级别的一致性通常无法区分正确的预测，但可以将不正确的预测区分开，但要取得了不错的要求，以取得必要的状态。

Title: On Language Models' Sensitivity to Suspicious Coincidences

Authors: Sriram Padmanabhan, Kanishka Misra, Kyle Mahowald, Eunsol Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09387
Pdf URL: https://arxiv.org/pdf/2504.09387
Copy Paste: [[2504.09387]] On Language Models' Sensitivity to Suspicious Coincidences(https://arxiv.org/abs/2504.09387)
Keywords: language model, prompt, chain-of-thought
Abstract: Humans are sensitive to suspicious coincidences when generalizing inductively over data, as they make assumptions as to how the data was sampled. This results in smaller, more specific hypotheses being favored over more general ones. For instance, when provided the set {Austin, Dallas, Houston}, one is more likely to think that this is sampled from "Texas Cities" over "US Cities" even though both are compatible. Suspicious coincidence is strongly connected to pragmatic reasoning, and can serve as a testbed to analyze systems on their sensitivity towards the communicative goals of the task (i.e., figuring out the true category underlying the data). In this paper, we analyze whether suspicious coincidence effects are reflected in language models' (LMs) behavior. We do so in the context of two domains: 1) the number game, where humans made judgments of whether a number (e.g., 4) fits a list of given numbers (e.g., 16, 32, 2); and 2) by extending the number game setup to prominent cities. For both domains, the data is compatible with multiple hypotheses and we study which hypothesis is most consistent with the models' behavior. On analyzing five models, we do not find strong evidence for suspicious coincidences in LMs' zero-shot behavior. However, when provided access to the hypotheses space via chain-of-thought or explicit prompting, LMs start to show an effect resembling suspicious coincidences, sometimes even showing effects consistent with humans. Our study suggests that inductive reasoning behavior in LMs can be enhanced with explicit access to the hypothesis landscape.
摘要：当人类对数据进行概括性概括时，人类对可疑的巧合敏感，因为它们对数据的采样方式做出了假设。这导致较小，更具体的假设比更一般的假设受到青睐。例如，当提供{Austin，Dallas，Houston}的集合时，即使两者都兼容，也更有可能认为这是从“美国城市”上的“德克萨斯城市”中采样的。可疑的巧合与务实的推理密切相关，并且可以用作测试台，以分析系统对任务的沟通目标的敏感性（即找出数据基础的真实类别）。在本文中，我们分析了语言模型（LMS）行为中是否反映了可疑的巧合效应。我们在两个域的上下文中这样做：1）数字游戏，人类对数字（例如4）是否适合给定数字的列表（例如16、32、2）做出判断； 2）将数字游戏设置扩展到著名城市。对于两个领域，数据都与多个假设兼容，我们研究哪些假设与模型的行为最一致。在分析五个模型时，我们找不到有力的证据表明LMS零射击行为中可疑的巧合。但是，当通过思考链或明确提示提供对假设空间的访问时，LMS开始显示出类似于可疑巧合的效果，有时甚至显示出与人类一致的效果。我们的研究表明，通过明确访问假设景观，可以增强LMS中的归纳推理行为。

Title: Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Authors: Vishakh Padmakumar, Chen Yueh-Han, Jane Pan, Valerie Chen, He He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09389
Pdf URL: https://arxiv.org/pdf/2504.09389
Copy Paste: [[2504.09389]] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models(https://arxiv.org/abs/2504.09389)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.
摘要：由于大型语言模型（LLM）越来越多地用于构想和科学发现，因此评估其产生新型输出的能力很重要。先前的工作将新颖性评估为训练数据的原创性，但原始产出可能是低质量。相反，非专家法官可能有利于高质量但记忆的产出，从而限制了人类偏爱作为度量的可靠性。我们为LLM世代提出了一个新的新颖性指标，可以平衡原创性和质量 - 训练期间看不见的\ ngram的谐波平均值和特定于任务的质量得分。我们评估了两个开放式模型（Olmo和Pythia）的几代人的新颖性，这些都是三个创意任务：故事完成，诗歌写作和创造性的工具使用。我们发现LLM生成的文本不如人类书面文本新颖。为了引起更多新颖的产出，我们尝试了各种推理时间方法，这揭示了创意和质量之间的权衡。尽管这些方法可以提高新颖性，但它们是通过以质量为代价来提高原创性的。相比之下，增加模型大小或应用后训练可靠地改变了帕累托的边界，强调从更强大的基本模型开始是提高新颖性的更有效方法。

Title: Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification

Authors: Joseph Liu, Yoonsoo Nam, Xinyue Cui, Swabha Swayamdipta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09394
Pdf URL: https://arxiv.org/pdf/2504.09394
Copy Paste: [[2504.09394]] Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification(https://arxiv.org/abs/2504.09394)
Keywords: language model, llm
Abstract: Despite the successes of language models, their evaluation remains a daunting challenge for new and existing tasks. We consider the task of text simplification, commonly used to improve information accessibility, where evaluation faces two major challenges. First, the data in existing benchmarks might not reflect the capabilities of current language models on the task, often containing disfluent, incoherent, or simplistic examples. Second, existing human ratings associated with the benchmarks often contain a high degree of disagreement, resulting in inconsistent ratings; nevertheless, existing metrics still have to show higher correlations with these imperfect ratings. As a result, evaluation for the task is not reliable and does not reflect expected trends (e.g., more powerful models being assigned higher scores). We address these challenges for the task of text simplification through three contributions. First, we introduce SynthSimpliEval, a synthetic benchmark for text simplification featuring simplified sentences generated by models of varying sizes. Through a pilot study, we show that human ratings on our benchmark exhibit high inter-annotator agreement and reflect the expected trend: larger models produce higher-quality simplifications. Second, we show that auto-evaluation with a panel of LLM judges (LLMs-as-a-jury) often suffices to obtain consistent ratings for the evaluation of text simplification. Third, we demonstrate that existing learnable metrics for text simplification benefit from training on our LLMs-as-a-jury-rated synthetic data, closing the gap with pure LLMs-as-a-jury for evaluation. Overall, through our case study on text simplification, we show that a reliable evaluation requires higher quality test data, which could be obtained through synthetic data and LLMs-as-a-jury ratings.
摘要：尽管语言模型取得了成功，但他们的评估仍然是对新任务和现有任务的艰巨挑战。我们考虑简化文本的任务，通常用于改善信息可访问性，而评估面临两个主要挑战。首先，现有基准中的数据可能无法反映当前语言模型在任务上的功能，通常包含不自我，不连贯或简单的示例。其次，与基准相关的现有人类评级通常包含高度分歧，导致评级不一致。然而，现有的指标仍然必须与这些不完美的评级显示更高的相关性。结果，对任务的评估不是可靠的，也不反映预期趋势（例如，分配了更高的分数模型）。我们通过三个贡献解决了简化文本的任务的这些挑战。首先，我们介绍了SynthSimplieval，这是一种简化文本的合成基准，该基准由不同尺寸的模型生成的简化句子。通过一项试点研究，我们表明，基准上的人类评分表现出高通道的一致性并反映了预期的趋势：较大的模型会产生更高质量的简化。其次，我们表明，由LLM法官（LLMS-AS-A-A-Joury）组成的自动评估通常足以获得一致的评分以评估文本简化。第三，我们证明了现有的可学习的指标，用于简化的文本简化受益，从我们的LLMS-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-AS-A-AS-A-A-Joury缩小差距进行评估。总体而言，通过我们对文本简化的案例研究，我们表明可靠的评估需要更高质量的测试数据，这可以通过合成数据和llms-as-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-yer等级获得。

Title: Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration

Authors: Feijiang Han, Licheng Guo, Hengtao Cui, Zhiyuan Lyu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09402
Pdf URL: https://arxiv.org/pdf/2504.09402
Copy Paste: [[2504.09402]] Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration(https://arxiv.org/abs/2504.09402)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often struggle with tasks that require a deep understanding of complex questions, especially when faced with long-range dependencies or multi-step reasoning. This work investigates the limitations of current LLMs in question comprehension and identifies three insights: (1) repeating question tokens improves comprehension by increasing attention to question regions, (2) increased backward dependencies negatively affect performance due to unidirectional attentional constraints, and (3) recalibrating attentional mechanisms to prioritize question-relevant regions improves performance. Based on these findings, we first propose a family of prompt-based strategies - Step-by-Step Reading (SSR), SSR+, and SSR++ - that guide LLMs to incrementally process question tokens and align their reasoning with the input structure. These methods significantly improve performance, with SSR++ achieving state-of-the-art results on several benchmarks: 96.66% on GSM8K, 94.61% on ASDiv, and 76.28% on AQuA. Second, we introduce a training-free attention recalibration mechanism that dynamically adjusts attention distributions during inference to emphasize question-relevant regions. This method improves the accuracy of LLaMA 3.1-8B on AQuA by 5.17% without changing model parameters or input prompts. Taken together, our results highlight the importance of structured prompt design and attention optimization in improving LLM comprehension, providing lightweight yet effective tools for improving performance in various NLP tasks.
摘要：大型语言模型（LLMS）经常在需要深入了解复杂问题的任务上挣扎，尤其是在面对远程依赖或多步推理时。这项工作调查了有关当前LLM的局限性理解，并确定了三个见解：（1）重复问题令牌通过增加对问题区域的关注来提高理解，（2）增加向后依赖性对单向注意力约束产生的绩效负面影响，以及由于注意力的注意力机制的重新计算，以提高问题 - 优先提高问题 - 提高问题的效果。基于这些发现，我们首先提出了一个基于及时的策略的家族 - 分步阅读（SSR），SSR+和SSR ++ - 指导LLMS逐步处理问题令牌，并将其推理与输入结构保持一致。这些方法显着提高了性能，SSR ++在几个基准上实现了最新的结果：GSM8K的96.66％，ASDIV的94.61％，Aqua的76.28％。其次，我们引入了一种无训练的注意重新校准机制，该机制在推断过程中动态调整注意力分布，以强调与问题相关的区域。此方法将Aqua上的Llama 3.1-8B的精度提高了5.17％，而无需更改模型参数或输入提示。综上所述，我们的结果强调了结构化及时设计和注意力优化在改善LLM理解方面的重要性，从而为改善各种NLP任务的性能提供了轻巧但有效的工具。

Title: UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

Authors: Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.09407
Pdf URL: https://arxiv.org/pdf/2504.09407
Copy Paste: [[2504.09407]] UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents(https://arxiv.org/abs/2504.09407)
Keywords: language model, llm, agent
Abstract: Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate a web design, but\textbf{ how to evaluate and iterate the usability testing study design } itself? Recent advances in Large Language Model-simulated Agent (\textbf{LLM Agent}) research inspired us to design \textbf{UXAgent} to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users to interactively test the target website. The system also provides an Agent Interview Interface and a Video Replay Interface so that the UX researchers can easily review and analyze the generated qualitative and quantitative log data. Through a heuristic evaluation, five UX researcher participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.
摘要：可用性测试是一种基本的研究方法，用户体验（UX）研究人员用来评估和迭代网页设计，但是\ textbf {如何评估和迭代可用性测试研究设计}本身？大语言模型模拟代理（\ textbf {llm agent}）的最新进展促使我们设计\ textbf {uxagent}，以支持UX研究人员在进行真实的人类对象研究之前评估和重申其可用性测试研究设计。我们的系统具有角色发电机模块，LLM代理模块和通用浏览器连接器模块，以自动生成数千个模拟用户，以交互性测试目标网站。该系统还提供了代理访谈接口和视频重播接口，以便UX研究人员可以轻松地查看和分析生成的定性和定量日志数据。通过启发式评估，五位UX研究员参与者称赞了我们系统的创新，但也对UX研究中LLM代理使用的未来表示担忧。

Title: SaRO: Enhancing LLM Safety through Reasoning-based Alignment

Authors: Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09420
Pdf URL: https://arxiv.org/pdf/2504.09420
Copy Paste: [[2504.09420]] SaRO: Enhancing LLM Safety through Reasoning-based Alignment(https://arxiv.org/abs/2504.09420)
Keywords: language model, llm, prompt
Abstract: Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.
摘要：当前针对大语模型（LLMS）的安全对准技术面临两个关键挑战：（1）普通化不足，这使模型容易受到新的越狱攻击的影响，以及（2）过度对准，这导致过度拒绝良性指令。我们的初步调查揭示了越狱/有害的查询与嵌入空间的正常提示之间的语义重叠，这表明更有效的安全对齐需要更深入的语义理解。这激发了我们将安全性驱动的推理纳入对齐过程。为此，我们提出了面向安全的推理优化框架（SARO），该框架由两个阶段组成：（1）推理风格的热身（RW），使LLMs可以通过监督的微调通过（2）通过安全的推理过程优化（SRPO）来内化长链推理，以通过直接优化的安全性优先征用（SRPO）。广泛的实验证明了Saro优于传统的对准方法。

Title: ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model

Authors: Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09421
Pdf URL: https://arxiv.org/pdf/2504.09421
Copy Paste: [[2504.09421]] ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model(https://arxiv.org/abs/2504.09421)
Keywords: language model, gpt, llm
Abstract: Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at this https URL.
摘要：大型语言模型（LLM）推理的最新进展已在数学和编码等领域中表现出了显着的推理能力，但它们在临床诊断中的应用仍未得到充满意。在这里，我们介绍了ClinicalGPT-R1，这是一种推理增强的疾病诊断大型通才大语模型。在一个数据集中培训了20,000个现实世界记录的数据集，ClinicalGpt-R1利用了各种培训策略来增强诊断推理。为了进行基准性能，我们策划了Medbench-Hard，这是一个具有挑战性的数据集，涵盖了七个主要的医学专业和代表性疾病。实验结果表明，在中国的诊断任务中，临床游戏R1在中国诊断任务中的表现优于GPT-4O，并且在英语环境中的诊断任务与GPT-4的表现可比。这项比较研究有效地验证了临床参见-R1在疾病诊断任务中的出色表现。资源可在此HTTPS URL上找到。

Title: HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs

Authors: Sharanya Dasgupta, Sujoy Nath, Arkaprabha Basu, Pourya Shamsolmoali, Swagatam Das
Subjects: cs.CL, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2504.09482
Pdf URL: https://arxiv.org/pdf/2504.09482
Copy Paste: [[2504.09482]] HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs(https://arxiv.org/abs/2504.09482)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at this https URL.
摘要：大型语言模型（LLMS）最近由于对跨多个领域的给定提示的创新响应而产生创新的响应，最近引起了广泛的关注。但是，LLMS通常会遭受幻觉的固有局限性，并在保持结构良好和相干的响应时产生错误的信息。在这项工作中，我们假设幻觉源于LLM的内部动力学。我们的观察结果表明，在通过时，LLM倾向于偏离响应细微部分的事实准确性，最终转移到错误信息上。这种现象与人类认知有相似之处，在这种情况下，个人可能在保持逻辑连贯性的同时幻觉，将不确定性嵌入其言语的小部分中。为了进一步研究这一点，我们引入了一种创新的方法Hallushift，旨在分析LLM生成的响应的内部状态空间和令牌概率的分布变化。与各种基准数据集中的现有基线相比，我们的方法的性能卓越。我们的代码库可在此HTTPS URL上找到。

Title: Kongzi: A Historical Large Language Model with Fact Enhancement

Authors: Jiashu Yang, Ningning Wang, Yian Zhao, Chaoran Feng, Junjia Du, Hao Pang, Zhirui Fang, Xuxin Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09488
Pdf URL: https://arxiv.org/pdf/2504.09488
Copy Paste: [[2504.09488]] Kongzi: A Historical Large Language Model with Fact Enhancement(https://arxiv.org/abs/2504.09488)
Keywords: language model, llm
Abstract: The capabilities of the latest large language models (LLMs) have been extended from pure natural language understanding to complex reasoning tasks. However, current reasoning models often exhibit factual inaccuracies in longer reasoning chains, which poses challenges for historical reasoning and limits the potential of LLMs in complex, knowledge-intensive tasks. Historical studies require not only the accurate presentation of factual information but also the ability to establish cross-temporal correlations and derive coherent conclusions from fragmentary and often ambiguous sources. To address these challenges, we propose Kongzi, a large language model specifically designed for historical analysis. Through the integration of curated, high-quality historical data and a novel fact-reinforcement learning strategy, Kongzi demonstrates strong factual alignment and sophisticated reasoning depth. Extensive experiments on tasks such as historical question answering and narrative generation demonstrate that Kongzi outperforms existing models in both factual accuracy and reasoning depth. By effectively addressing the unique challenges inherent in historical texts, Kongzi sets a new standard for the development of accurate and reliable LLMs in professional domains.
摘要：最新大型语言模型（LLM）的功能已从纯自然语言的理解扩展到复杂的推理任务。但是，当前的推理模型经常在较长的推理链中表现出事实不准确，这对历史推理构成了挑战，并限制了LLM在复杂的，知识密集的任务中的潜力。历史研究不仅需要准确地介绍事实信息，还需要建立跨时空相关性并从零碎和通常模棱两可的来源得出连贯的结论的能力。为了应对这些挑战，我们提出了专为历史分析设计的大型语言模型Kongzi。通过整合精心策划的，高质量的历史数据和新颖的事实提倡学习策略，Kongzi表现出强烈的事实一致性和复杂的推理深度。对诸如历史问题回答和叙事生成等任务的广泛实验表明，孔齐在事实准确性和推理深度方面都优于现有模型。通过有效解决历史文本中固有的独特挑战，孔齐为在专业领域中制定准确可靠的LLM的开发设定了新标准。

Title: MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs

Authors: Wei Tao, Xiaoyang Qu, Kai Lu, Jiguang Wan, Guokuan Li, Jianzong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09504
Pdf URL: https://arxiv.org/pdf/2504.09504
Copy Paste: [[2504.09504]] MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs(https://arxiv.org/abs/2504.09504)
Keywords: language model, llm
Abstract: When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.
摘要：当应用预训练的大语言模型（LLM）来解决异常检测任务时，多元时间序列（MTS）异常检测模态与LLMS的文本模式不符。现有方法简单地将MTS数据转换为多个单变量时间序列序列，这可能会导致许多问题。本文介绍了Madllm，这是一种新型的多元异常检测方法，通过预训练的LLM。我们设计了一种新的三重编码技术，以使MTS模式与LLMS的文本模式保持一致。具体而言，该技术将传统的补丁嵌入方法与两种新颖的嵌入方法集成在一起：跳过嵌入，从传统方法中改变贴片处理的顺序，以帮助LLMS保留对先前特征的知识和功能嵌入，从而利用对比性学习，从而使模型可以更好地了解不同特征之间的相关性。实验结果表明，在各种公共异常检测数据集中，我们的方法优于最先进的方法。

Title: How new data permeates LLM knowledge and how to dilute it

Authors: Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09522
Pdf URL: https://arxiv.org/pdf/2504.09522
Copy Paste: [[2504.09522]] How new data permeates LLM knowledge and how to dilute it(https://arxiv.org/abs/2504.09522)
Keywords: language model, llm, hallucination
Abstract: Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM's existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a ``stepping-stone'' text augmentation strategy and (2) an ``ignore-k'' update pruning method. These approaches reduce undesirable priming effects by 50-95\% while preserving the model's ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: this https URL
摘要：大型语言模型通过基于梯度的更新的积累来学习和不断学习，但是各个新信息如何影响现有知识，导致有益的概括和有问题的幻觉，仍然对此知之甚少。我们证明，在学习新信息时，LLM会表现出“启动”效果：学习新事实可能会导致模型不适当地将知识应用于无关的环境中。为了系统地研究这一现象，我们介绍了一个经过精心策划的数据集的“ Overlandish”，该数据集是1320种不同的文本样本，旨在探究新知识如何通过LLM现有的知识库渗透。使用此数据集，我们表明，可以通过在学习前测量关键单词的令牌概率来预测新信息的启动程度。这种关系在不同的模型体系结构（Palm-2，Gemma，Llama），大小和训练阶段都坚固。最后，我们开发了两种新颖的技术来调节新知识如何影响现有的模型行为：（1）````踩踏石'''文本增强策略和（2）``imange-knoore-k''更新修剪方法。这些方法在保留模型学习新信息的能力的同时，将不良的启动效应降低了50-95 \％。我们的发现提供了有关LLM的学习方式的经验见解和改善语言模型中知识插入特异性的实用工具。更多材料：此HTTPS URL

Title: Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution

Authors: Chenghao Li, Chaoning Zhang, Yi Lu, Jiaquan Zhang, Qigan Sun, Xudong Wang, Jiwei Wei, Guoqing Wang, Yang Yang, Heng Tao Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09566
Pdf URL: https://arxiv.org/pdf/2504.09566
Copy Paste: [[2504.09566]] Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution(https://arxiv.org/abs/2504.09566)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping", "Exactness" and "Minimality", enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at this https URL.
摘要：通过将问题分解为顺序的步骤，模仿人类逻辑并减少错误，促使经营链（COT）提示了大语言模型（LLM）的推理。但是，具有庞大解决方案空间和模糊约束的复杂任务通常超过单个推理链的能力。受到交换代数和代数几何形状中最小自由分辨率（MFR）的启发，我们提出了思想的Syzygy（SOT） - 一种新颖的框架，通过引入辅助，相互关联的推理路径来扩展COT。 SOT捕获了更深层次的逻辑依赖性，从而实现了更健壮和结构化的问题。 MFR将一个模块分解为具有最小等级的一系列自由模块，为复杂系统提供了结构化的分析方法。该方法介绍了“模块”，“ betti数字”，“ freeness”，“映射”，“精确性”和“最小值”的概念，从而使原始复杂问题的系统分解能够在逻辑完整的最小值子问题上，同时保留关键问题特征并减少关键问题特征。我们测试了跨不同数据集（例如GSM8K，数学）和模型（例如GPT-4O-Mini，QWEN2.5）的SOT，达到了与主流COTS标准品相匹配或超过主流COTS标准的推理准确性。另外，通过将采样过程与代数约束对齐，我们的方法可以增强LLMS中推断时间的可扩展性，从而确保透明的推理和高性能。我们的代码将在此HTTPS URL上公开可用。

Title: LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

Authors: Biao Fu, Minpeng Liao, Kai Fan, Chengxi Li, Liang Zhang, Yidong Chen, Xiaodong Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09570
Pdf URL: https://arxiv.org/pdf/2504.09570
Copy Paste: [[2504.09570]] LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline(https://arxiv.org/abs/2504.09570)
Keywords: language model, llm, prompt
Abstract: When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt "Translate the following sentence from [src lang] into [tgt lang]:". However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.
摘要：提供完整的源句子后，即使使用简单的提示，大型语言模型（LLMS）在离线机器翻译中表现出色，即“将以下句子从[SRC Lang]转换为[TGT Lang]：”。但是，在许多实际情况下，源代币以流媒体方式到达，同时需要机器翻译（SIMT），因此仅解码器llms的效率和性能受到其自动回归性质的限制。为了使LLM能够像离线翻译一样有效地实现高质量的SIMT，我们提出了一种新颖的范式，其中包括为SIMT构建监督的微调（SFT）数据，以及新的培训和推理策略。为了复制SIMT中的令牌输入/输出流，将源和目标令牌重新排列为交织的序列，根据不同的延迟要求，由特殊令牌分开。这使功能强大的LLM可以根据不同的延迟提示来自适应地学习读写操作，同时仍保持有效的自动回归解码。实验结果表明，即使使用有限的SFT数据，我们的方法也可以在各种SIMT基准测试中实现最先进的性能，并保留离线翻译的原始能力。此外，我们的方法很好地推广到文档级别的SIMT设置，而无需特定的微调，甚至超出了离线翻译模型。

Title: Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

Authors: Zuoli Tang, Junjie Ou, Kaiqin Hu, Chunwei Wu, Zhaoxin Huan, Chilin Fu, Xiaolu Zhang, Jun Zhou, Chenliang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09586
Pdf URL: https://arxiv.org/pdf/2504.09586
Copy Paste: [[2504.09586]] Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance(https://arxiv.org/abs/2504.09586)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent years have witnessed significant progress in large language models' (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. However, human beings are naturally cognitive misers and will prompt language models to give rather short responses, thus raising a significant conflict with CoT reasoning. In this paper, we delve into how LLMs' reasoning performance changes when users provide short-path prompts. The results and analysis reveal that language models can reason effectively and robustly without explicit CoT prompts, while under short-path prompting, LLMs' reasoning ability drops significantly and becomes unstable, even on grade-school problems. To address this issue, we propose two approaches: an instruction-guided approach and a fine-tuning approach, both designed to effectively manage the conflict. Experimental results show that both methods achieve high accuracy, providing insights into the trade-off between instruction adherence and reasoning accuracy in current models.
摘要：近年来，在大型语言模型（LLM）推理中取得了重大进展，这在很大程度上是由于采用经过思考链（COT）方法，从而使模型能够在达到最终答案之前生成中间的推理步骤。在这些进步的基础上，最先进的LLM会进行指导，以便在回答与推理有关的问题时提供长长而详细的COT途径。但是，人类自然是认知苦难者，并将促使语言模型提供相当短的反应，从而与COT推理产生重大冲突。在本文中，我们深入研究了LLMS的推理性能在用户提供短路提示时的变化。结果和分析表明，语言模型可以在没有明确的COT提示的情况下有效，稳健地进行推理，而在短路提示下，LLMS的推理能力显着下降并且变得不稳定，即使在成绩学院问题上也是如此。为了解决这个问题，我们提出了两种方法：一种指导引导的方法和一种微调方法，均旨在有效地管理冲突。实验结果表明，这两种方法都具有很高的精度，从而提供了对当前模型中指导依从性和推理准确性之间权衡的见解。

Title: Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference

Authors: Yuta Matsui, Ryosuke Yamaki, Ryo Ueda, Seitaro Shinagawa, Tadahiro Taniguchi
Subjects: cs.CL, cs.AI, cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2504.09620
Pdf URL: https://arxiv.org/pdf/2504.09620
Copy Paste: [[2504.09620]] Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference(https://arxiv.org/abs/2504.09620)
Keywords: language model, agent
Abstract: We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.
摘要：我们提出了大都市束缚字幕游戏（MHCG），这是一种通过彼此学习来融合多个视觉模型（VLM）知识的方法。尽管结合多种模型的现有方法遭受了推理成本和建筑限制的影响，但MHCG通过类似于语言游戏的过程进行分散的贝叶斯推论来避免这些问题。知识融合过程建立了两个VLM代理之间的交流，交替字幕图像和彼此学习。我们对两个VLM进行了两个图像捕获实验，每个实验都在不同的数据集上进行了预训练。第一个实验表明，MHCG在无参考评估指标方面取得了一致的改进。第二个实验调查了MHCG如何通过观察生成的字幕中的词汇出现来分享VLMS的类别级词汇。

Title: Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability

Authors: Haotian Wang, Han Zhao, Shuaiting Chen, Xiaoyu Tian, Sitong Zhao, Yunjie Ji, Yiping Peng, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09639
Pdf URL: https://arxiv.org/pdf/2504.09639
Copy Paste: [[2504.09639]] Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability(https://arxiv.org/abs/2504.09639)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate "thinking" steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
摘要：大型语言模型（LLMS）的最新进展，例如DeepSeek-R1和OpenAI-O1，已经证明了测试时间缩放的显着有效性，从而在各种基准测试中取得了可观的绩效提高。这些高级模型利用故意的“思考”步骤系统地增强答案质量。在本文中，我们提出了利用由推理密集型模型产生的这些高质量输出，以改善计算率较低的非调理模型。我们探索并比较了利用推理模型产生的答案来训练和改善非争议模型所产生的答案。通过直接监督的微调（SFT）实验，对已建立的基准进行了实验，我们在各种基准测试中表现出一致的改进，强调了这种方法的潜力，可以提高模型直接回答问题的能力。

Title: Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Authors: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
Subjects: cs.CL, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.09643
Pdf URL: https://arxiv.org/pdf/2504.09643
Copy Paste: [[2504.09643]] Iterative Self-Training for Code Generation via Reinforced Re-Ranking(https://arxiv.org/abs/2504.09643)
Keywords: gpt
Abstract: Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.
摘要：生成解决复杂编程任务的高质量代码很具有挑战性，尤其是对于当前基于解码器的模型而产生高度随机输出的模型。在代码生成中，即使是小错误也可以轻松打破整个解决方案。利用多个采样解决方案可以显着提高整体产出质量。增强代码生成的一种有效方法是将代码生成模型与Reranker模型配对，该模型从生成的样品中选择最佳解决方案。我们建议使用近端策略优化（PPO）为自我训练的重读者模型（PPO）提出一种新型的迭代自我训练方法，旨在提高重新计算的准确性和整体代码生成过程。与传统的PPO方法不同，该方法的重点是通过奖励模型优化生成模型，我们的方法强调了强大的奖励/重新播放模型的发展。该模型通过重读来提高生成的代码的质量，并解决奖励模型在与Reranker的PPO对齐过程中可能忽略的问题和错误。我们的方法通过重新评估输出，识别高分负面示例并将其纳入培训循环，从而提高模型性能，从而迭代地完善培训数据集。我们对Multipl-E数据集的评估表明，我们的13.4B参数模型在代码生成质量方面的表现优于33B模型，而三倍的速度则超过了三倍。此外，它可以实现与GPT-4相当的性能，并以一种编程语言超越它。

Title: Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar

Authors: Aung Kyaw Htet, Mark Dras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09645
Pdf URL: https://arxiv.org/pdf/2504.09645
Copy Paste: [[2504.09645]] Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar(https://arxiv.org/abs/2504.09645)
Keywords: language model, llm
Abstract: Despite dramatic recent progress in NLP, it is still a major challenge to apply Large Language Models (LLM) to low-resource languages. This is made visible in benchmarks such as Cross-Lingual Natural Language Inference (XNLI), a key task that demonstrates cross-lingual capabilities of NLP systems across a set of 15 languages. In this paper, we extend the XNLI task for one additional low-resource language, Myanmar, as a proxy challenge for broader low-resource languages, and make three core contributions. First, we build a dataset called Myanmar XNLI (myXNLI) using community crowd-sourced methods, as an extension to the existing XNLI corpus. This involves a two-stage process of community-based construction followed by expert verification; through an analysis, we demonstrate and quantify the value of the expert verification stage in the context of community-based construction for low-resource languages. We make the myXNLI dataset available to the community for future research. Second, we carry out evaluations of recent multilingual language models on the myXNLI benchmark, as well as explore data-augmentation methods to improve model performance. Our data-augmentation methods improve model accuracy by up to 2 percentage points for Myanmar, while uplifting other languages at the same time. Third, we investigate how well these data-augmentation methods generalise to other low-resource languages in the XNLI dataset.
摘要：尽管NLP最近取得了巨大进展，但将大型语言模型（LLM）应用于低资源语言仍然是一个重大挑战。这在基准中可见，例如跨语性自然语言推断（XNLI），这是一项关键任务，它在一组15种语言中展示了NLP系统的跨语性功能。在本文中，我们将XNLI任务扩展到另一种低资源语言缅甸，这是对更广泛的低资源语言的代理挑战，并做出了三个核心贡献。首先，我们使用社区拥挤的方法构建了一个名为Myanmar XNLI（myxnli）的数据集，作为现有XNLI语料库的扩展。这涉及一个基于社区建设的两个阶段的过程，然后进行专家验证。通过分析，我们在基于社区的低资源语言的基础构造中证明和量化了专家验证阶段的价值。我们将myxnli数据集可用于社区以供将来的研究。其次，我们对Myxnli基准的最新多语言模型进行了评估，并探索了提高模型性能的数据提升方法。我们的数据提升方法将缅甸的模型准确性提高了2个百分点，同时提高了其他语言。第三，我们研究了这些数据提升方法在XNLI数据集中推广到其他低资源语言的程度。

Title: CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering

Authors: Liqiang Wen, Guanming Xiong, Tong Mo, Bing Li, Weiping Li, Wen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09665
Pdf URL: https://arxiv.org/pdf/2504.09665
Copy Paste: [[2504.09665]] CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering(https://arxiv.org/abs/2504.09665)
Keywords: language model, llm, agent
Abstract: This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.
摘要：这项研究解决了知识图答录（KGQA）中歧义的挑战。尽管最近的KGQA系统已经取得了重大进展，尤其是在大型语言模型（LLMS）的集成下，他们通常认为用户查询是明确的，这是一个很少存在于现实世界中应用程序中的假设。为了解决这些局限性，我们提出了一个新颖的框架，该框架可以动态处理实体歧义（例如，区分具有相似名称的实体）和意图歧义（例如，通过交互式澄清阐明对用户查询的不同解释）。我们的方法采用贝叶斯推理机制来量化查询歧义，并指导LLMS确定何时以及如何在多转向对话框架内索取用户的澄清。我们进一步开发了两个代理交互框架，基于LLM的用户模拟器可以通过模拟用户反馈对逻辑形式进行迭代改进。 WebQSP和CWQ数据集的实验结果表明，我们的方法通过有效解决语义歧义可显着提高性能。此外，我们贡献了一个依据依据的查询的精致数据集，这些数据集是从相互作用历史中得出的，以促进未来的研究。

Title: Domain-Adaptive Continued Pre-Training of Small Language Models

Authors: Salman Faroz
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09687
Pdf URL: https://arxiv.org/pdf/2504.09687
Copy Paste: [[2504.09687]] Domain-Adaptive Continued Pre-Training of Small Language Models(https://arxiv.org/abs/2504.09687)
Keywords: language model
Abstract: Continued pre-training of small language models offers a promising path for domain adaptation with limited computational resources. I've investigated this approach within educational domains, evaluating it as a resource-efficient alternative to training models from scratch. Using a 125M parameter model, I demonstrate significant performance improvements through incremental training on 400 million tokens, followed by further training to reach 1 billion tokens. My approach includes comprehensive data preprocessing, memory-optimized training configurations, and benchmark-based evaluation. Results show notable gains in knowledge-intensive tasks (MMLU +8.1%) and contextual understanding (HellaSwag +7.6%), while revealing educational domain specialization trade-offs. I analyze token efficiency, catastrophic forgetting mitigation strategies, and scaling patterns. My findings suggest that thoughtful preprocessing and training methodologies enable meaningful improvements in language model capabilities even with constrained computational resources, opening pathways for domain-specific adaptation of smaller language models.
摘要：小语言模型的持续预培训为有限的计算资源提供了一种有希望的域适应道路。我已经在教育领域中调查了这种方法，将其视为从头开始的培训模型的一种资源效率替代品。使用12500万参数模型，我通过对4亿个代币的增量培训来证明绩效的重大改进，然后进一步培训达到10亿代币。我的方法包括全面的数据预处理，内存优化的培训配置以及基于基准测试的评估。结果表明，在知识密集型任务（MMLU +8.1％）和上下文理解（Hellaswag +7.6％）中取得了显着收益，同时揭示了教育领域的专业化权衡。我分析了令牌效率，灾难性忘记缓解策略和扩展模式。我的发现表明，即使在计算资源有限的情况下，周到的预处理和培训方法也可以改善语言模型功能，这是针对较小语言模型的域特异性适应的开放途径。

Title: GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

Authors: Jixiao Zhang, Chunsheng Zuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09696
Pdf URL: https://arxiv.org/pdf/2504.09696
Copy Paste: [[2504.09696]] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models(https://arxiv.org/abs/2504.09696)
Keywords: language model
Abstract: Recent advances in R1-like reasoning models leveraging Group Relative Policy Optimization (GRPO) have significantly improved the performance of language models on mathematical reasoning tasks. However, current GRPO implementations encounter critical challenges, including reward sparsity due to binary accuracy metrics, limited incentives for conciseness, and insufficient focus on complex reasoning tasks. To address these issues, we propose GRPO-LEAD, a suite of novel enhancements tailored for mathematical reasoning. Specifically, GRPO-LEAD introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems. Furthermore, we systematically examine the impact of model scale and supervised fine-tuning (SFT) strategies, demonstrating that larger-scale base models and carefully curated datasets significantly enhance reinforcement learning effectiveness. Extensive empirical evaluations and ablation studies confirm that GRPO-LEAD substantially mitigates previous shortcomings, resulting in language models that produce more concise, accurate, and robust reasoning across diverse mathematical tasks.
摘要：利用小组相对策略优化（GRPO）的类似R1的推理模型的最新进展已显着改善了语言模型在数学推理任务上的性能。但是，当前的GRPO实施遇到了关键的挑战，包括由于二进制精度指标引起的奖励稀疏性，简洁性的有限激励措施以及对复杂推理任务的关注不足。为了解决这些问题，我们提出了GRPO-LEAD，这是一套针对数学推理量身定制的新型增强功能。具体而言，GRPO-LEAD引入了（1）长度依赖的准确性奖励，以鼓励简洁而精确的解决方案，（2）明确的惩罚机制，用于不正确的答案，以促进决策界限，以及（3）一种困难的优势重新加权策略，以扩大学习挑战性问题的学习信号。此外，我们系统地研究了模型量表和监督微调（SFT）策略的影响，表明大型基本模型和精心策划的数据集可显着提高强化学习有效性。广泛的经验评估和消融研究证实，GRPO铅极大地减轻了先前的缺点，从而导致语言模型在各种数学任务中产生更简洁，准确和强大的推理。

Title: Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

Authors: Ayşe Aysu Cengiz, Ahmet Kaan Sever, Elif Ecem Ümütlü, Naime Şeyma Erdem, Burak Aytan, Büşra Tufan, Abdullah Topraksoy, Esra Darıcı, Cagri Toraman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09714
Pdf URL: https://arxiv.org/pdf/2504.09714
Copy Paste: [[2504.09714]] Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish(https://arxiv.org/abs/2504.09714)
Keywords: gpt, llm
Abstract: The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages.
摘要：对英语或多语言资源的翻译或改编数据集的依赖引入了有关语言和文化适用性的挑战。这项研究通过评估17种常用土耳其基准数据集的质量来解决对鲁棒和文化适当基准的需求。使用一个评估六个标准的综合框架，人类和LLM法官的注释者都提供了详细的评估，以识别数据集优势和缺点。我们的结果表明，70％的基准数据集无法符合我们的启发式质量标准。使用技术术语的正确性是最强的标准，但是在检查的数据集中不满足85％的标准。尽管LLM法官表现出潜力，但它们比人类注释效果不佳，尤其是在理解文化常识知识和解释流利，明确的文本方面。 GPT-4O具有更强的语法和技术任务标签功能，而Llama3.3-70B在正确性和文化知识评估方面表现出色。我们的发现强调，迫切需要更严格的质量控制，以创建和适应低资源语言的数据集。

Title: Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Authors: Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Hamza Farooq
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09753
Pdf URL: https://arxiv.org/pdf/2504.09753
Copy Paste: [[2504.09753]] Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance(https://arxiv.org/abs/2504.09753)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.
摘要：大型语言模型（LLM）表现出了显着的功能，但是它们的发展主要集中在英语和其他高资源语言上，使许多语言都乏味。我们介绍了最新的印度英语双语llm \ textbf {mantra-14b}，两种语言的基准分数平均提高了3 \％，表现的模型超过了其大小的两倍。使用由485K样本的英语和印地语指令数据组成的策划数据集，我们指导了调谐模型，例如QWEN-2.5-14B-INSTRUCT和PHI-4，以提高英语和印地语的性能。我们的实验包括七个不同的不同参数大小不同的LLM和超过140次培训尝试，而不同的英语印度训练数据比例表明，可以显着提高多语言性能，而不会损害本地绩效。此外，我们的方法避免了资源密集型技术，例如词汇扩展或体系结构修改，从而使模型尺寸较小。我们的结果表明，对文化和本地知情数据进行适度的微调可以弥合性能差距，而不会产生大量的计算开销。我们将培训代码，数据集和模型发布在MIT和Apache许可下，以帮助进一步研究不足和低资源语言。

Title: Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Authors: Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09763
Pdf URL: https://arxiv.org/pdf/2504.09763
Copy Paste: [[2504.09763]] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems(https://arxiv.org/abs/2504.09763)
Keywords: llm
Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for math reasoning as problem generators for stress-testing models. However, prior work has been limited to abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced math problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions an LLM on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. We demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs e.g. finding problem variations that are harder or easier for a learner to solve, as well as data generation.
摘要：科学家经常从特定问题实例中推断出抽象程序，并使用抽象来生成新的相关实例。例如，编码系统正式规则和属性的程序在从RL（过程环境）到物理（仿真引擎）的字段中很有用。这些程序可以看作是根据其参数化（例如，网格世界配置或初始物理条件）执行不同输出的功能。我们介绍了EFA一词（可执行的功能抽象），以表示此类数学问题程序。已证明类似EFA的构建体可用于数学推理作为应力测试模型的问题产生者。但是，先前的工作仅限于级别数学的抽象（其简单的规则在程序中易于编码），而为高级数学生成EFA已需要人类工程。我们探讨了EFA的自动构造，以解决高级数学问题。我们将自动构建EFA作为程序综合任务的任务进行操作，并开发Efagen，该任务在种子数学问题及其逐步解决方案的LLM上，以生成忠实于种子问题基础的广义问题和解决方案类的候选EFA程序。此外，我们将任何有效的EFA都必须在可执行的单位测试方面正式化，并展示如何将测试用作可验证的奖励来训练LLM，以成为EFA的更好作家。我们证明，Efagen构建的EFA通过忠于种子问题，产生可学习的问题变化而在理性上表现出色，并且Efagen可以推断出跨竞争级数学问题来源的Efas。最后，我们显示了模型编写的EFA的下游用途，例如寻找问题变化，对于学习者而言更容易或更容易解决数据以及数据生成。

Title: Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

Authors: Jingtian Wu, Claire Cardie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09781
Pdf URL: https://arxiv.org/pdf/2504.09781
Copy Paste: [[2504.09781]] Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning(https://arxiv.org/abs/2504.09781)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.
摘要：尽管大型语言模型（LLMS）在问题回答和事实验证等任务中表现出了强大的功能，但它们继续遭受幻觉和推理错误的困扰，尤其是在需要集成多个信息源的多跳任务中。当前方法通过基于检索的技术（外部证据中的基础推理），基于推理的方法（通过改进提示提高提示提高连贯性）或结合两个元素的混合策略来解决这些问题。一种突出的混合方法，反应，纯粹是基于基于推理的方法或基于推理的方法。但是，它缺乏对中间推理步骤的内部验证，从而使潜在的错误通过复杂的推理任务传播。在本文中，我们介绍了推理法院（RC），这是一个新颖的框架，该框架扩展了迭代推理和反应方法，例如React，与专门的LLM法官。与React不同，RC雇用该法官独立评估多个候选答案及其由单独的LLM代理产生的相关推理。要求法官根据提出的推理和证据选择最扎根的答案，即根据所提出的推理和证据，或使用可用证据及其预先培训的知识合成新的答案，如果所有候选人都不足，有缺陷或无效。对多跳基准测试（HotPotQA，Musique）和事实验证（发烧）的评估表明，RC始终超越最先进的少数弹药，而没有特定于任务的微调。

Title: VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Authors: Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
Subjects: cs.CL, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2504.09795
Pdf URL: https://arxiv.org/pdf/2504.09795
Copy Paste: [[2504.09795]] VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents(https://arxiv.org/abs/2504.09795)
Keywords: language model, retrieval-augmented generation
Abstract: We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
摘要：我们旨在开发一个检索型的一代（RAG）框架，该框架回答了以混合方式（例如，图表，表格）和不同格式（例如PDF，PPTX）呈现的视觉丰富文档的问题。在本文中，我们引入了一个新的抹布框架Vdocrag，该框架可以直接理解统一图像格式的各种文档和模式，以防止通过解析文档获得文本而发生的丢失信息。为了提高性能，我们提出了新颖的自我监督预训练的预训练任务，这些任务通过将视觉信息压缩到密集的令牌表示中，同时使它们与文档中的文本内容对齐，以适应大型视觉模型来检索。此外，我们介绍了OpenDocVQA，这是开放域文档的第一个统一收集的视觉问题回答数据集，其中包含各种文档类型和格式。 OPENDOCVQA提供了一个综合资源，用于培训和评估在开放域设置中视觉富裕文档的检索和问答模型。实验表明，Vdocrag基本上要优于常规的基于文本的抹布，并且具有强大的概括能力，从而突出了有效的RAG范式对于实际文档的潜力。

Title: Training Small Reasoning LLMs with Cognitive Preference Alignment

Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09802
Pdf URL: https://arxiv.org/pdf/2504.09802
Copy Paste: [[2504.09802]] Training Small Reasoning LLMs with Cognitive Preference Alignment(https://arxiv.org/abs/2504.09802)
Keywords: language model, llm, chain-of-thought, agent
Abstract: The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.
摘要：大型语言模型（LLM）的推理能力，例如Openai的O1和DeepSeek-R1，通过深思熟虑的方式看到了很大的进步。但是，这些增强功能伴随着巨大的资源需求，强调了探索培训有效推理LLM的策略，其参数少得多。一个关键的挑战是，较小的模型具有与较大的同行不同的能力和认知轨迹。因此，从大LLM到较小的链条的直接蒸馏（COT）有时可能是无效的，并且需要大量的带注释的数据。在本文中，我们介绍了一个名为“批评 - 重点验证”（CRV）的新型框架，该框架旨在训练较小而强大的推理LLM。我们的CRV框架由多种LLM代理组成，每个代理都专门具有独特的能力：（i）根据较小模型的认知能力来批评COTS，（ii）根据批评重新思考和完善这些COTS，以及（iii）验证精制结果的正确性。我们进一步提出了认知偏好优化（COGPO）算法，以通过使这些模型的思想与它们的认知能力相结合来增强较小模型的推理能力。对挑战性推理基准的全面评估表明，CRV和COGPO的功效，这可以超过其他培训方法。

Title: Transferable text data distillation by trajectory matching

Authors: Rong Yao, Hailin Hu, Yifei Fu, Hanting Chen, Wenyi Fang, Fanyi Du, Kai Han, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09818
Pdf URL: https://arxiv.org/pdf/2504.09818
Copy Paste: [[2504.09818]] Transferable text data distillation by trajectory matching(https://arxiv.org/abs/2504.09818)
Keywords: language model, llm, prompt
Abstract: In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).
摘要：在大型语言模型（LLM）领域，随着大型模型的规模的增加，它也带来了更高的培训成本。迫切需要最大程度地减少LLM培训中的数据大小。与数据选择方法相比，数据蒸馏方法旨在合成少量数据样本以实现完整数据集的训练效果，并具有更好的灵活性。尽管在计算机视觉方面取得了成功，但文本数据的离散性迄今仍在自然语言处理（NLP）中探索。在这项工作中，我们提出了一种方法，该方法涉及基于轨迹匹配并找到其最近的邻居ID以实现跨体系结构转移的方法提示数据。在蒸馏过程中，我们引入了正规化损失，以提高蒸馏数据的鲁棒性。据我们所知，这是第一个适合文本生成任务（例如教学调整）的数据蒸馏工作。对两个基准测试的评估，包括Arc-Easy和MMLU指令调整数据集，确定了我们蒸馏方法优于SOTA数据选择方法的优越性。此外，我们的方法证明了对LLM结构的良好可传递性（即选择去骆驼）。

Title: Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English

Authors: Michael Kamerath, Aniello De Santo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09886
Pdf URL: https://arxiv.org/pdf/2504.09886
Copy Paste: [[2504.09886]] Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English(https://arxiv.org/abs/2504.09886)
Keywords: llm
Abstract: This paper leverages past sentence processing studies to investigate whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. Furthermore, we test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies interestingly across models, but also general failings of these models in correctly capturing human-like preferences. In light of these results, we argue that RC attachment is the ideal benchmark for cross-linguistic investigations of LLMs' linguistic knowledge and biases.
摘要：本文利用过去的句子处理研究来研究单语言和多语言LLM是否显示出意大利语和英语相对子句依恋歧义的示例时，是否显示出类似人类的偏好。此外，我们测试这些偏好是否可以通过词汇因子（矩阵子句中的动词/名词的类型）调节，这些因素已被证明与对句法和语义关系的微妙约束有关。我们的结果总体上展示了LLM行为如何在模型中有趣的变化，但这些模型的一般失败也可以正确捕获类似人类的偏好。鉴于这些结果，我们认为RC依恋是对LLMS语言知识和偏见进行跨语言研究的理想基准。

Title: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Authors: Shuai Zhao, Linchao Zhu, Yi Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09895
Pdf URL: https://arxiv.org/pdf/2504.09895
Copy Paste: [[2504.09895]] Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data(https://arxiv.org/abs/2504.09895)
Keywords: language model, llm
Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In various alignment scenarios, such as general human preference, safety, and confidence alignment, binary preference data collection and reward modeling are resource-intensive but necessary for human preference transferring. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function for LLM alignment. Using similarity as a reward circumvents training reward models, and collecting a single reference answer potentially costs less time than constructing binary preference pairs when multiple candidates are available. Specifically, we develop \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm, which is free of reference and reward models. Instead, RefAlign utilizes BERTScore between sampled generations and high-quality reference answers as the surrogate reward. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, {RefAlign} demonstrates comparable performance to previous alignment methods while offering high efficiency.
摘要：大型语言模型〜（LLMS）有助于帮助，无害和诚实。在各种对齐场景中，例如人类的偏好，安全性和信心一致性，二进制偏好数据收集和奖励建模是资源密集型的，但对于人类偏好转移而言是必需的。在这项工作中，我们使用采样世代和高质量参考答案之间的相似性作为LLM对齐的替代奖励功能。使用相似性作为奖励规避培训奖励模型，并收集单个参考答案的可能性比在有多个候选人时构建二进制偏好对的时间要少。具体来说，我们开发\ textit {refalign}，这是一种多功能增强式的对齐算法，它是免费的，无参考和奖励模型。取而代之的是，Refalign利用Bertscore在采样几代和高质量的参考答案之间作为替代奖励。除了一般的人类偏好优化之外，可以通过将相似性奖励与与任务相关的目标结合在一起，从而可以轻松地扩展到各种情况，例如安全和信心一致性。在各种情况下，{Refalign}在提供高效率的同时，表现出与以前的对齐方式相当的性能。

Title: TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models

Authors: Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09896
Pdf URL: https://arxiv.org/pdf/2504.09896
Copy Paste: [[2504.09896]] TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models(https://arxiv.org/abs/2504.09896)
Keywords: gpt
Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.
摘要：情感分析是自然语言处理（NLP）的至关重要任务，它可以从文本数据中提取有意义的见解，尤其是从Twitter和IMDB等动态平台中提取。这项研究探讨了一个混合框架，结合了基于变压器的模型，特别是BERT，GPT-2，Roberta，Xlnet和Distilbert，以提高情感分类的准确性和鲁棒性。该框架通过利用这些模型的独特优势来解决诸如嘈杂数据，上下文歧义和跨越不同数据集的概括之类的挑战。 Bert捕获双向上下文，GPT-2增强了生成能力，Roberta通过较大的语料库和动态掩蔽优化上下文理解，通过基于置换的学习来依赖XLNET模型，并且Distilbert在保持较高精度的同时提供了降低的计算开销效率。我们使用术语频率逆文档频率（TF-IDF）和单词袋（BOW）演示文本清洁，令牌化和特征提取，并确保模型的高质量输入数据。在基准数据集Mentiment140和IMDB上评估了混合方法，分别达到94 \％和95 \％的卓越精度，表现优于独立模型。结果验证了将多个变压器模型组合在集合设置中以解决单个体系结构的局限性的有效性。这项研究强调了它适用于现实世界任务，例如社交媒体监控，客户情感分析和公众舆论跟踪，这为混合NLP框架中未来进步提供了途径。

Title: Refining Financial Consumer Complaints through Multi-Scale Model Interaction

Authors: Bo-Wei Chen, An-Zi Yen, Chung-Chi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09903
Pdf URL: https://arxiv.org/pdf/2504.09903
Copy Paste: [[2504.09903]] Refining Financial Consumer Complaints through Multi-Scale Model Interaction(https://arxiv.org/abs/2504.09903)
Keywords: language model, llm, prompt
Abstract: Legal writing demands clarity, formality, and domain-specific precision-qualities often lacking in documents authored by individuals without legal training. To bridge this gap, this paper explores the task of legal text refinement that transforms informal, conversational inputs into persuasive legal arguments. We introduce FinDR, a Chinese dataset of financial dispute records, annotated with official judgments on claim reasonableness. Our proposed method, Multi-Scale Model Interaction (MSMI), leverages a lightweight classifier to evaluate outputs and guide iterative refinement by Large Language Models (LLMs). Experimental results demonstrate that MSMI significantly outperforms single-pass prompting strategies. Additionally, we validate the generalizability of MSMI on several short-text benchmarks, showing improved adversarial robustness. Our findings reveal the potential of multi-model collaboration for enhancing legal document generation and broader text refinement tasks.
摘要：法律写作要求在没有法律培训的个人中撰写的文件中通常缺乏清晰度，形式和特定领域的精确质量。为了弥合这一差距，本文探讨了法律文本完善的任务，该任务将非正式的会话输入转化为有说服力的法律论点。我们介绍了Findr，这是一个中国的财务争议记录数据集，并注明了有关索赔合理性的官方判决。我们提出的方法是多尺度模型相互作用（MSMI），利用轻量级分类器来评估大语模型（LLMS）的输出并指导迭代的改进。实验结果表明，MSMI明显胜过单通行的提示策略。此外，我们验证了MSMI在几个短文本基准上的普遍性，从而提高了对抗性鲁棒性。我们的发现揭示了多模型协作的潜力，即增强法律文档生成和更广泛的文本精致任务。

Title: Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models

Authors: Yujing Wang, Hainan Zhang, Liang Pang, Yongxin Tong, Binghui Guo, Hongwei Zheng, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09910
Pdf URL: https://arxiv.org/pdf/2504.09910
Copy Paste: [[2504.09910]] Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2504.09910)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.
摘要：检索增强的生成（RAG）是将LLMS应用于专有领域的有前途的技术。但是，检索的文件可能包含敏感知识，从而在生成结果中构成了隐私泄漏的风险。因此，从检索到的文档中有效删除私人信息是破布的关键挑战。与传统的文本匿名不同，RAG应该考虑：（1）固有的多文件推理可能会面临匿名化攻击；（2）私人知识因方案而有所不同，因此应允许用户自定义要删除的信息；（3）为发电任务保留足够的公开知识。本文介绍了RAG的隐私擦除任务，并提出了Eraser4Rag，这是一种私人知识橡皮擦，可有效地从文档中删除用户定义的私人知识，同时保留足够的公共知识来生成。具体来说，我们首先构建了一个全球知识图，以识别跨文档的潜在知识，以防止匿名化攻击。然后，我们将其随机分为私人和公共子图，然后微调Flan-T5，以重写不包括私人三元的文件。最后，PPO算法优化了重写模型，以最大程度地减少私人三元组并最大化公共三元组。在四个QA数据集上的实验表明，Eraser4RAG比GPT-4O具有出色的擦除性能。

Title: Guiding Reasoning in Small Language Models with LLM Assistance

Authors: Yujin Kim, Euiin Yi, Minu Kim, Se-Young Yun, Taehyeon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09923
Pdf URL: https://arxiv.org/pdf/2504.09923
Copy Paste: [[2504.09923]] Guiding Reasoning in Small Language Models with LLM Assistance(https://arxiv.org/abs/2504.09923)
Keywords: language model, llm
Abstract: The limited reasoning capabilities of small language models (SLMs) cast doubt on their suitability for tasks demanding deep, multi-step logical deduction. This paper introduces a framework called Small Reasons, Large Hints (SMART), which selectively augments SLM reasoning with targeted guidance from large language models (LLMs). Inspired by the concept of cognitive scaffolding, SMART employs a score-based evaluation to identify uncertain reasoning steps and injects corrective LLM-generated reasoning only when necessary. By framing structured reasoning as an optimal policy search, our approach steers the reasoning trajectory toward correct solutions without exhaustive sampling. Our experiments on mathematical reasoning datasets demonstrate that targeted external scaffolding significantly improves performance, paving the way for collaborative use of both SLM and LLM to tackle complex reasoning tasks that are currently unsolvable by SLMs alone.
摘要：小语言模型（SLM）的推理能力有限，对其对要求进行深度多步逻辑推论的任务的适用性表示怀疑。本文介绍了一个名为“小理由”的框架，大提示（SMART），该框架通过大型语言模型（LLMS）有针对性的指导有选择地增强SLM推理。受认知脚手架的概念的启发，Smart采用基于分数的评估来确定不确定的推理步骤，并仅在必要时才能纠正LLM生成的推理。通过将结构化推理构建为最佳政策搜索，我们的方法将推理轨迹转向正确的解决方案而无需详尽的抽样。我们在数学推理数据集上的实验表明，针对的外部脚手架可显着提高性能，为SLM和LLM的协作使用铺平了道路，以解决仅SLMS目前无法解决的复杂推理任务。

Title: C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset

Authors: Fuqiang Niu, Yi Yang, Xianghua Fu, Genan Dai, Bowen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09958
Pdf URL: https://arxiv.org/pdf/2504.09958
Copy Paste: [[2504.09958]] C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset(https://arxiv.org/abs/2504.09958)
Keywords: language model
Abstract: Stance detection has become an essential tool for analyzing public discussions on social media. Current methods face significant challenges, particularly in Chinese language processing and multi-turn conversational analysis. To address these limitations, we introduce C-MTCSD, the largest Chinese multi-turn conversational stance detection dataset, comprising 24,264 carefully annotated instances from Sina Weibo, which is 4.2 times larger than the only prior Chinese conversational stance detection dataset. Our comprehensive evaluation using both traditional approaches and large language models reveals the complexity of C-MTCSD: even state-of-the-art models achieve only 64.07% F1 score in the challenging zero-shot setting, while performance consistently degrades with increasing conversation depth. Traditional models particularly struggle with implicit stance detection, achieving below 50% F1 score. This work establishes a challenging new benchmark for Chinese stance detection research, highlighting significant opportunities for future improvements.
摘要：立场检测已成为分析社交媒体上的公众讨论的重要工具。当前方法面临重大挑战，特别是在中文处理和多转化对话分析中。为了解决这些限制，我们介绍了C-MTCSD，这是中国最大的多转交谈姿态检测数据集，其中包括24,264个仔细注释的SINA Weibo的实例，这是唯一唯一的中国对话对话立场检测数据集的4.2倍。我们使用传统方法和大型语言模型的全面评估揭示了C-MTCSD的复杂性：即使是最先进的模型也仅在具有挑战性的零拍设置中获得64.07％的F1得分，而性能会随着对话深度的增加而持续降低。传统模型尤其在隐性立场检测方面挣扎，达到50％的F1得分。这项工作为中国立场检测研究建立了一个挑战性的新基准，重点凸显了未来改进的大量机会。

Title: The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination

Authors: Hao Yin, Gunagzong Si, Zilei Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10020
Pdf URL: https://arxiv.org/pdf/2504.10020
Copy Paste: [[2504.10020]] The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination(https://arxiv.org/abs/2504.10020)
Keywords: language model, llm, hallucination
Abstract: Contrastive decoding strategies are widely used to reduce hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.
摘要：对比解码策略被广泛用于减少多模式模型（MLLM）中的幻觉。这些方法是通过构建对比样品来诱导幻觉然后在输出分布中抑制它们来起作用的。但是，本文表明，这种方法无法有效缓解幻觉问题。在教皇基准上观察到的绩效改进在很大程度上是由两个误导因素驱动的：（1）对模型的输出分布进行粗略的，单向调整，以及（2）自适应合理性约束，这将采样策略减少了贪婪搜索。为了进一步说明这些问题，我们介绍了一系列虚假的改进方法，并评估其针对对比度解码技术的性能。实验结果表明，对比解码中观察到的性能提高与减轻幻觉的预期目标完全无关。我们的发现挑战了关于对比解码策略有效性的共同假设，并为在MLLM中开发幻觉的真正有效解决方案铺平了道路。

Title: DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify

Authors: Zhengxuan Zhang, Zhuowen Liang, Yin Wu, Teng Lin, Yuyu Luo, Nan Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10036
Pdf URL: https://arxiv.org/pdf/2504.10036
Copy Paste: [[2504.10036]] DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify(https://arxiv.org/abs/2504.10036)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) are transforming data analytics, but their widespread adoption is hindered by two critical limitations: they are not explainable (opaque reasoning processes) and not verifiable (prone to hallucinations and unchecked errors). While retrieval-augmented generation (RAG) improves accuracy by grounding LLMs in external data, it fails to address the core challenges of trustworthy analytics - especially when processing noisy, inconsistent, or multi-modal data (for example, text, tables, images). We propose DataMosaic, a framework designed to make LLM-powered analytics both explainable and verifiable. By dynamically extracting task-specific structures (for example, tables, graphs, trees) from raw data, DataMosaic provides transparent, step-by-step reasoning traces and enables validation of intermediate results. Built on a multi-agent framework, DataMosaic orchestrates self-adaptive agents that align with downstream task requirements, enhancing consistency, completeness, and privacy. Through this approach, DataMosaic not only tackles the limitations of current LLM-powered analytics systems but also lays the groundwork for a new paradigm of grounded, accurate, and explainable multi-modal data analytics.
摘要：大型语言模型（LLMS）正在转换数据分析，但是它们的广泛采用受到两个关键局限性的阻碍：它们不能解释（不透明的推理过程），而不是可验证的（容易遭受幻觉和未经检查的错误）。虽然检索增强的生成（RAG）通过将LLMS接地在外部数据中提高了准确性，但它无法解决可信赖分析的核心挑战 - 尤其是在处理嘈杂，不一致或多模式数据时（例如，文本，表格，图像，图像）。我们提出了DataMosaic，这是一个旨在使LLM驱动分析均可解释和可验证的框架。通过从原始数据中动态提取特定于任务的结构（例如表，图形，树），DataMosaic提供了透明的，分步的推理跟踪，并可以验证中间结果。 Datamosaic建立在多代理框架的基础上，协调与下游任务要求相符的自适应代理，增强一致性，完整性和隐私。通过这种方法，Datamosaic不仅应对当前LLM供电的分析系统的局限性，而且为新的接地，准确且可解释的多模式数据分析的新范式奠定了基础。

Title: Hallucination Detection in LLMs via Topological Divergence on Attention Graphs

Authors: Alexandra Bazarova, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Andrei Volodichev, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10063
Pdf URL: https://arxiv.org/pdf/2504.10063
Copy Paste: [[2504.10063]] Hallucination Detection in LLMs via Topological Divergence on Attention Graphs(https://arxiv.org/abs/2504.10063)
Keywords: language model, llm, hallucination, prompt
Abstract: Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments, including evaluation on question answering and data-to-text tasks, show that our approach achieves state-of-the-art or competitive results on several benchmarks, two of which were annotated by us and are being publicly released to facilitate further research. Beyond its strong in-domain performance, TOHA maintains remarkable domain transferability across multiple open-source LLMs. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.
摘要：幻觉，即产生事实不正确的内容，仍然是大型语言模型（LLMS）的关键挑战。我们介绍了TOHA，这是一个基于拓扑的幻觉检测器，在抹布环境中，它利用拓扑差异度量来量化注意矩阵引起的图形的结构特性。检查及时和响应子图之间的拓扑差异揭示了一致的模式：特定注意力头的较高差异值与幻觉输出相关，而与数据集无关。广泛的实验，包括对问答的评估和数据到文本任务，表明我们的方法在几个基准上取得了最新的或竞争成果，其中两个由我们注释，并正在公开发布，以促进进一步的研究。除了其强大的内域性能外，Toha还保持了多个开源LLM的显着域可转移性。我们的发现表明，分析注意力矩阵的拓扑结构可以作为LLMS事实可靠性的有效且可靠的指标。

Title: Towards Quantifying Commonsense Reasoning with Mechanistic Insights

Authors: Abhinav Joshi, Areeb Ahmad, Divyaksh Shukla, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10077
Pdf URL: https://arxiv.org/pdf/2504.10077
Copy Paste: [[2504.10077]] Towards Quantifying Commonsense Reasoning with Mechanistic Insights(https://arxiv.org/abs/2504.10077)
Keywords: llm, prompt
Abstract: Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^{17}), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.
摘要：常识性推理涉及人类对人类充分理解的隐性知识，通常是通过与世界互动获得的。最近，已经使用基于文本的任务评估了对各种LLM的常识性推理和理解。在这项工作中，我们认为，可以将这种理解的代表保持为图形结构，可以进一步帮助对各种现实世界活动的常识性推理能力进行严格的评估。我们创建了一个注释方案，用于以37个日常活动的图形结构的形式捕获这种隐性知识。我们发现，创建的资源可用于构建大量常识性查询（〜10^{17}），从而促进了LLMS中常分推理的严格评估。此外，最近，LLM的出色表现提出了有关这些模型是否真正能够在野外推理的问题，总的来说，推理是如何在这些模型中发生的。在此资源文件中，我们通过提出促进类似方向研究的设计机制来弥合这一差距。我们的发现表明，推理组件位于LLM中，在提示使用常识性查询时，在决策中起着重要作用。

Title: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Authors: Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Jingxuan Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.10157
Pdf URL: https://arxiv.org/pdf/2504.10157
Copy Paste: [[2504.10157]] SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users(https://arxiv.org/abs/2504.10157)
Keywords: language model, llm, agent
Abstract: Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.
摘要：社会模拟正在通过通过虚拟个人及其环境之间的互动来建模人类行为来改变传统的社会科学研究。随着大语言模型（LLM）的最新进展，这种方法表明，在捕获个体差异和预测群体行为方面的潜力越来越大。但是，现有方法面临与环境，目标用户，互动机制和行为模式相关的一致性挑战。为此，我们介绍了社会模拟社会驱动的世界驱动世界模型社会versevers。我们的框架具有四个强大的对齐组件和一个1000万个真实个人的用户池。为了验证其有效性，我们在三个不同的领域进行了大规模的模拟实验：政治，新闻和经济学。结果表明，社会诉讼可以反映大规模的人口动态，同时通过标准化程序和最少的手动调整来确保多样性，信誉和代表性。

Title: MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

Authors: Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10160
Pdf URL: https://arxiv.org/pdf/2504.10160
Copy Paste: [[2504.10160]] MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning(https://arxiv.org/abs/2504.10160)
Keywords: language model, gpt, llm
Abstract: Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at this https URL.
摘要：大规模加强学习（RL）方法已被证明在增强大语言模型（LLM）的推理能力方面非常有效，尤其是对于具有可验证解决方案（例如数学和编码）的任务。但是，将此想法应用于机器翻译（MT），其中输出的格式灵活并且难以自动使用明确的规则自动评估，但仍未得到充实。在这项工作中，我们介绍了MT-R1-Zero，这是MT R1-Zero RL框架的第一个开源改编，而无需监督微调或冷启动。我们提出了一种规则的混合奖励机制，可以通过紧急推理指导LLMS提高翻译质量。在WMT 24英语基准测试中，我们的MT-R1-Zero-3b-Mix取得了竞争性能，超过Towerinstruct-7b-v0.2的平均得分为1.26点。同时，我们的MT-R1-Zero-7b-Mix在所有指标中达到62.25的高平均得分，与高级专有模型（如GPT-4O和Claude-3.5-Sonnet）相当，而MT-R1-R1-Zero-Zero-Zero-7b-7b-Sem-Sem Achieves在语义上的量表上是态度的。此外，我们的工作在分布的MT任务上表现出强大的概括能力，可以强大地支持多语言和低资源设置。对不同初始化和奖励指标跨模型行为的广泛分析为奖励设计，LLM适应性，训练动力学和新兴推理模式的关键作用提供了先驱见解。我们的代码可在此HTTPS URL上找到。

Title: C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

Authors: Xu Zhang, Zhifei Liu, Jiahao Wang, Huixuan Zhang, Fan Xu, Junzhe Zhang, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10167
Pdf URL: https://arxiv.org/pdf/2504.10167
Copy Paste: [[2504.10167]] C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation(https://arxiv.org/abs/2504.10167)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.
摘要：尽管大型语言模型的发展迅速，但它们仍然非常容易受到幻觉的影响，从而极大地阻碍了他们的广泛应用。幻觉研究需要动态和细粒度的评估。但是，大多数现有的幻觉基准（尤其是中文）都依赖人类注释，从而使自动且具有成本效益的幻觉评估具有挑战性。为了解决这个问题，我们介绍了Haluagent，这是一个代理框架，该框架会根据某些知识文档自动构建细粒度的QA数据集。我们的实验表明，手动设计的规则和及时优化可以提高生成数据的质量。使用Haluagent，我们构建了C-Faith，这是一种中国质量检查幻觉基准，该基准是由1,399个知识文件创建的，总计60,702个条目。我们通过建议的C-Faith全面评估16个主流LLM，提供详细的实验结果和分析。

Title: HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection

Authors: Mohamed A. Abdallah, Samhaa R. El-Beltagy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10168
Pdf URL: https://arxiv.org/pdf/2504.10168
Copy Paste: [[2504.10168]] HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection(https://arxiv.org/abs/2504.10168)
Keywords: language model, llm, hallucination
Abstract: In this paper, we present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs. Developed as part of Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in fourteen different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech. While the system's retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.
摘要：在本文中，我们提出了Hallusearch，这是一条多语言管道，旨在检测大型语言模型（LLM）输出的捏造文本跨度。 Hallusearch夫妇以幻觉和相关可观察到的过度误差的多种语言共享任务开发，该任务是通过良好的事实分开的，以识别和本地化14种不同语言的幻觉。经验评估表明，Hallusearch的表现竞争性，在英语中排名第四（在十大百分比之内）和捷克语。尽管该系统的基于检索的策略通常证明是强大的，但它在在线覆盖范围有限的语言中面临挑战，强调了进一步研究的需求，以确保跨不同语言环境的持续幻觉检测。

Title: LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

Authors: Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10185
Pdf URL: https://arxiv.org/pdf/2504.10185
Copy Paste: [[2504.10185]] LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks(https://arxiv.org/abs/2504.10185)
Keywords: language model, llm
Abstract: Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at this https URL.
摘要：大型语言模型的学习已成为确保安全和受控模型行为的关键挑战，通过在保留通用公用事业的同时，从预算模型中删除不希望的数据模型。最近的重大努力致力于开发LLM学习的基准，例如WMDP（大规模杀伤性武器）和Muse（机器六向评估），促进了标准化的不学习绩效评估和方法比较。尽管它们有用，但我们首次在这些基准中发现了新的核心效应。具体而言，我们发现使用原始（完整）忘记集获得的LLM可以使用明显较小的子集有效地维护（作为“核心”），例如，即使在随机选择的情况下，也只需少于忘记集的5％。这表明，即使在极低的DATA制度中，这些基准测试中的LLM在这些基准测试中也可以令人惊讶地执行。我们证明，无论使用的LLM未学习方法如何，NPO（负偏好优化）和RMU（表示误导性误导），这些核心效应仍然很强。在各种数据选择方法中，出乎意料的强核效应也很强，从随机选择到更复杂的启发式方法。我们通过基于关键字的视角解释了LLM学习的核心效应，这表明仅从《忘记》集中提取的关键字对未学习的效率显着贡献，并表明当前的未学习是由一组紧凑的高损坏令牌驱动的，而不是整个数据集。我们进一步证明了沿其他维度的核心未清除模型的忠诚，例如模式的连接性和对越狱攻击的稳健性。代码可在此HTTPS URL上找到。

Title: Deep Reasoning Translation via Reinforcement Learning

Authors: Jiaan Wang, Fandong Meng, Jie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10187
Pdf URL: https://arxiv.org/pdf/2504.10187
Copy Paste: [[2504.10187]] Deep Reasoning Translation via Reinforcement Learning(https://arxiv.org/abs/2504.10187)
Keywords: llm
Abstract: Recently, deep reasoning LLMs (e.g., OpenAI o1/o3 and DeepSeek-R1) have shown promising performance in various complex tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation and taking cultural differences into account. This task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning. Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought process. Given the source sentences, the reward model teaches the deep translation model how to think and free-translate them during reinforcement learning. In this way, training DeepTrans does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning baselines as well as baselines that are fine-tuned with synthesized data. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.
摘要：最近，深层推理LLM（例如OpenAI O1/O3和DeepSeek-R1）在各种复杂的任务中表现出了令人鼓舞的表现。自由翻译是多语言世界中一项重要而有趣的任务，它需要超越单词的翻译并考虑到文化差异。在深度推理LLM中，此任务仍未探索。在本文中，我们介绍了深层Trans，这是一种深层的推理翻译模型，该模型通过强化学习来学习自由翻译。具体来说，我们在翻译结果和思考过程中仔细构建了一个具有预定义评分标准的奖励模型。鉴于源句子，奖励模型教授了深层翻译模型如何在加固学习过程中思考和自由翻译它们。通过这种方式，训练深层不需要任何标记的翻译，避免了人类密集型注释或资源密集型数据综合。实验结果表明深晶型的有效性。 DeepTrans使用QWEN2.5-7B作为骨干，在文献翻译中提高了16.3％的性能，并且胜过强烈的深层推理基准以及通过合成数据进行微调的基线。此外，我们总结了RL探索期间的失败和有趣的发现。我们希望这项工作能够激发其他研究人员的自由翻译。

Title: Localized Cultural Knowledge is Conserved and Controllable in Large Language Models

Authors: Veniamin Veselovsky, Berke Argin, Benedikt Stroebl, Chris Wendler, Robert West, James Evans, Thomas L. Griffiths, Arvind Narayanan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10191
Pdf URL: https://arxiv.org/pdf/2504.10191
Copy Paste: [[2504.10191]] Localized Cultural Knowledge is Conserved and Controllable in Large Language Models(https://arxiv.org/abs/2504.10191)
Keywords: language model, llm, prompt
Abstract: Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English-centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit-implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non-English languages we explore, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.
摘要：正如人类在说新语言时表现出受母语影响的语言模式一样，LLMS也经常默认为以英语为中心的响应，即使在其他语言中产生。然而，我们观察到，当地的文化信息持续存在于模型中，并且可以很容易被激活以进行文化定制。我们首先证明，在提示中明确提供文化背景可以显着提高模型产生文化本地化反应的能力。我们将模型性能的差异与没有明确的文化背景相比，明确的定位差距，表明尽管LLM内存在文化知识，但如果未提供文化背景，它可能不会自然地在多语言互动中浮出水面。但是，尽管有明确的提示，但答案的多样性减少了，并且倾向于刻板印象。其次，我们确定了我们探索的所有非英语语言保守的明确的文化定制向量，这使LLM可以从合成的英国文化世界模型转向每个非英国文化世界。转向的响应保留了隐性提示的多样性，并减少了刻板印象，以显着提高定制的潜力。我们讨论了明确的文化定制对理解LLM中替代文化世界模型的保护的含义，以及它们可控的效用，对翻译，文化定制以及通过软控制LLM功能和吸引力而明确隐含的可能性。

Title: DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation

Authors: Hanghui Guo, Jia Zhu, Shimin Di, Weijie Shi, Zhangze Chen, Jiajie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10198
Pdf URL: https://arxiv.org/pdf/2504.10198
Copy Paste: [[2504.10198]] DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation(https://arxiv.org/abs/2504.10198)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Dynamic Retrieval-augmented Generation (RAG) has shown great success in mitigating hallucinations in large language models (LLMs) during generation. However, existing dynamic RAG methods face significant limitations in two key aspects: 1) Lack of an effective mechanism to control retrieval triggers, and 2) Lack of effective scrutiny of retrieval content. To address these limitations, we propose an innovative dynamic RAG method, DioR (Adaptive Cognitive Detection and Contextual Retrieval Optimization), which consists of two main components: adaptive cognitive detection and contextual retrieval optimization, specifically designed to determine when retrieval is needed and what to retrieve for LLMs is useful. Experimental results demonstrate that DioR achieves superior performance on all tasks, demonstrating the effectiveness of our work.
摘要：动态检索增强的生成（RAG）在生成期间在缓解大语言模型（LLM）中的幻觉方面取得了巨大的成功。但是，现有的动态抹布方法在两个关键方面面临着重大局限性：1）缺乏控制检索触发器的有效机制，以及2）缺乏对检索含量的有效审查。为了解决这些局限性，我们提出了一种创新的动态抹布方法，DIOR（自适应认知检测和上下文检索优化），该方法由两个主要组成部分组成：自适应认知检测和上下文检索优化，专为确定何时需要检索以及为LLMS回收的内容是有用的。实验结果表明，Dior在所有任务上都取得了卓越的表现，证明了我们工作的有效性。

Title: Probing then Editing Response Personality of Large Language Models

Authors: Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10227
Pdf URL: https://arxiv.org/pdf/2504.10227
Copy Paste: [[2504.10227]] Probing then Editing Response Personality of Large Language Models(https://arxiv.org/abs/2504.10227)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that exhibit consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in encoding personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly encode personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at this https URL.
摘要：大型语言模型（LLM）表现出有希望的能力，可以产生表现出一致的人格特征的响应。尽管主要尝试通过基于输出的评估来分析人格表达，但对于在LLM参数中如何内部编码这些特征的知识知之甚少。在本文中，我们介绍了一个层面探测框架，以系统地研究LLM在编码人格响应中的层面层次能力。我们对人格基准的11个开源LLM进行了探测实验，发现LLMS主要编码个性以在其中层和上层响应，并具有指导性调整的模型，表明人格特质的分离稍明确。此外，通过将训练的探测超平面解释为每个人格类别的层边界，我们提出了一种层面的扰动方法来编辑推理过程中LLMS表达的人格。我们的结果表明，即使提示明确指定特定个性，我们的方法仍然可以成功改变LLM的响应人格。有趣的是，在某些人格特征之间转换的困难有很大的不同，这与我们的探测实验中的代表性距离保持一致。最后，我们进行了全面的MMLU基准评估和时间间接费用分析，这表明我们提出的个性编辑方法仅在一般能力中最小降级，同时保持低培训成本和可接受的推断潜伏期。我们的代码在此HTTPS URL上公开可用。

Title: Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

Authors: Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10284
Pdf URL: https://arxiv.org/pdf/2504.10284
Copy Paste: [[2504.10284]] Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol(https://arxiv.org/abs/2504.10284)
Keywords: llm, prompt
Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at this https URL.
摘要：文献综述表对于总结和比较科学论文的集合至关重要。我们探讨了生成最能满足用户信息需求的表的任务，鉴于一系列科学论文。在最近的工作（Newman等，2024）的基础上，我们扩展了通过基于LLM的方法和人类注释的结合来解决现实世界复杂性的先前方法。我们的贡献着重于现实世界中遇到的三个关键挑战：（i）用户提示通常不明显；（ii）检索的候选论文经常包含无关紧要的内容；（iii）任务评估应超越浅文本相似性技术，而是评估了推断表的效用，以寻求信息的任务（例如，比较论文）。为了支持可重复的评估，我们介绍了Arxiv2table，这是该任务的更现实，更具挑战性的基准，以及一种新颖的方法，用于改善现实世界中的文献综述餐桌生成。我们对这个基准测试的广泛实验表明，开放权重和专有的LLM都在努力工作，强调了它的困难和进一步的进步。我们的数据集和代码可在此HTTPS URL上找到。

Title: MorphTok: Morphologically Grounded Tokenization for Indian Languages

Authors: Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10335
Pdf URL: https://arxiv.org/pdf/2504.10335
Copy Paste: [[2504.10335]] MorphTok: Morphologically Grounded Tokenization for Indian Languages(https://arxiv.org/abs/2504.10335)
Keywords: language model, llm
Abstract: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.
摘要：令牌化是NLP的关键步骤，尤其是在大型语言模型（LLM）的兴起，影响下游性能，计算成本和效率。现有的LLM依靠经典的字节对编码（BPE）算法来贪婪地合并频繁的字符bigrams的子字令牌。这通常会导致分割，而这种细分与语言上有意义的单位不符。为了解决这个问题，我们建议在应用BPE之前，将形态感知的分割作为预言式步骤。为了促进形态感知的分割，我们为印地语和马拉地语创建了一个新颖的数据集，并结合了Sandhi拆分以增强子单词令牌化。下游任务的实验表明，形态上扎根的令牌化改善了机器翻译和语言建模的性能。此外，为了处理NiCode字符中的歧义，尤其是基于音节的写作系统中的依赖元音，我们引入了受限的BPE（CBPE），这是对传统BPE算法的扩展，该算法包含了脚本特定约束。具体而言，CBPE处理依赖元音。我们的结果表明，CBPE的生育能力得分降低了1.68 \％，同时保持机器翻译中的可比或改进的下游性能，从而提供了标准BPE的计算有效替代方案。此外，为了评估跨不同令牌化算法的细分，我们引入了一个新的人类评估指标，\ textit {evaltok}，从而实现了更多的人为基础的评估。

Title: Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families

Authors: Shahriar Noroozizadeh, Sayantan Kumar, Jeremy C. Weiss
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10340
Pdf URL: https://arxiv.org/pdf/2504.10340
Copy Paste: [[2504.10340]] Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families(https://arxiv.org/abs/2504.10340)
Keywords: language model, llm
Abstract: Clinical case reports encode rich, temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings--extracted via an LLM-assisted annotation pipeline--serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.
摘要：临床病例报告编码了依赖结构化数据的传统机器学习方法通常不忽视的丰富，暂时的患者轨迹。在这项工作中，我们介绍了文本时间序列中的预测问题，这些问题是通过LLM辅助注释管道提取的时间戳临床发现 - 作为预测的主要输入。我们系统地评估了事件发生预测，时间订购和生存分析的任务，包括基于微调的解码器大语言模型和基于编码器的变压器在内的各种模型。我们的实验表明，基于编码器的模型始终获得更高的F1分数和较高的时间一致性，以预测短期和长期事件，而微调的掩蔽方法可以增强排名性能。相反，指导调整的解码器模型在生存分析中表现出相对优势，尤其是在早期预后环境中。我们的灵敏度分析进一步证明了时间顺序的重要性，这需要临床时间序列构建，与文本排序相比，LLM经过经典培训的文本输入的格式。这突出了可以从时间订购的语料库中确定的额外好处，这对广泛使用LLM使用时代的时间任务产生了影响。

Title: VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Authors: Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10342
Pdf URL: https://arxiv.org/pdf/2504.10342
Copy Paste: [[2504.10342]] VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge(https://arxiv.org/abs/2504.10342)
Keywords: language model
Abstract: Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.
摘要：当前的多模式基准通常将推理与特定于领域的知识混为一谈，因此很难隔离和评估非专家设置中的一般推理能力。为了解决这个问题，我们介绍了VisualPuzzles，这是一种针对视觉推理的基准，同时故意最大程度地减少对专业知识的依赖。 VisualPuzzles包括跨越五个类别的各种问题：算法，类比，演绎，电感和空间推理。我们问题的主要来源是中国公务员检查中手动翻译逻辑推理问题。实验表明，与MMMU等基准相比，视觉插曲需要明显较少的域特异性知识和更复杂的推理，从而使我们能够更好地评估真正的多模式推理。评估表明，最先进的多模式大型语言模型始终落后于人类在VisualPuzzles上的绩效，并且在知识密集型基准上的强劲表现并不一定会转化为以推理为重点的，知识的任务的成功。此外，推理增强功能，例如扩展推理计算（具有“思考”模式）在模型和任务类型之间产生不一致的收益，并且我们观察到模型大小和性能之间没有明确的相关性。我们还发现，与对知识更加重视的基准相比，模型在视觉广播中表现出不同的推理和回答模式。 VisualPuzzle提供了更清晰的镜头，可以通过该视角评估事实回忆和领域知识以外的推理能力。

Title: MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Authors: Dieuwke Hupkes, Nikolay Bogoychev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10356
Pdf URL: https://arxiv.org/pdf/2504.10356
Copy Paste: [[2504.10356]] MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages(https://arxiv.org/abs/2504.10356)
Keywords: llm, chat
Abstract: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.
摘要：我们提出了Multiloko，这是一种用于评估涵盖31种语言的LLM中多语言的新基准。 Multiloko由三个分区组成：一个主要分区，该分区由每种语言的500个问题组成，分别采购了与特定语言的本地相关，有两个翻译的分区，其中包含从30种非英语语言到英语的人类实现的翻译，反之亦然。为了进行比较，我们还发布了相应的机器作者的翻译。数据在两个拆分上平均分布：开发式分裂和盲目的，分布的测试拆分。 Multiloko可用于研究有关LLM的多语言以及有关多语言基准创建的元问题的各种问题。我们计算了11个基础的多曲科和聊天模型的多语言，并研究其平均表现，他们跨语言的表现均等，他们回答问题的能力取决于问题语言的能力以及哪些语言最困难。我们研究的模型在Multiloko上都没有很好的表现，这表明得分低，最佳评分语言之间的差异很大。此外，我们发现了问题语言的实质性效果，表明语言之间的次优知识转移。最后，我们发现，使用本地与英语翻译数据可能会导致最佳性能模型的差异超过20点，从而大大改变了某些语言的估计难度。对于使用机器而不是人类翻译，我们发现对语言难度的排序效果较弱，模型排名的差异更大，并且所有模型的估计性能都大大下降。

Title: DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

Authors: Aryan Shrivastava, Paula Akemi Aoyagui
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10359
Pdf URL: https://arxiv.org/pdf/2504.10359
Copy Paste: [[2504.10359]] DICE: A Framework for Dimensional and Contextual Evaluation of Language Models(https://arxiv.org/abs/2504.10359)
Keywords: language model
Abstract: Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.
摘要：语言模型（LMS）越来越多地集成到广泛的应用中，但是现代评估范式并不能充分反映它们实际使用的方式。当前的评估依赖于通常缺乏部署LMS的现实环境中直接适用性的基准。为了解决这一差距，我们提出了维度和上下文评估（DICE），这种方法可以评估LMS在颗粒状，上下文依赖性维度上。在该职位论文中，我们首先检查了现有的LM基准测试的不足，突出了它们对现实用例的有限适用性。接下来，我们提出了一组颗粒状评估参数，以捕获LM行为的维度，这些尺寸对各种应用程序域的利益相关者更有意义。具体而言，我们介绍了上下文不稳定参数的概念，例如鲁棒性，连贯性和认知诚实 - 以及必须针对选择将LMS部署LMS部署到特定环境的特定上下文约束和需求量身定制的上下文特定参数。然后，我们讨论了操作此评估框架的潜在方法，并结束了骰子对LM评估格局的挑战和挑战。最终，这项工作是对LMS的特定环境和利益相关者评估的实用且平易近人的起点。

Title: S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Authors: Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10368
Pdf URL: https://arxiv.org/pdf/2504.10368
Copy Paste: [[2504.10368]] S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models(https://arxiv.org/abs/2504.10368)
Keywords: llm
Abstract: We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models' (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their reliance on deep analytical thinking may limit their system 1 thinking capabilities. Moreover, a lack of benchmark currently exists to evaluate LRMs' performance in tasks that require such capabilities. To fill this gap, S1-Bench presents a set of simple, diverse, and naturally clear questions across multiple domains and languages, specifically designed to assess LRMs' performance in such tasks. Our comprehensive evaluation of 22 LRMs reveals significant lower efficiency tendencies, with outputs averaging 15.5 times longer than those of traditional small LLMs. Additionally, LRMs often identify correct answers early but continue unnecessary deliberation, with some models even producing numerous errors. These findings highlight the rigid reasoning patterns of current LRMs and underscore the substantial development needed to achieve balanced dual-system thinking capabilities that can adapt appropriately to task complexity.
摘要：我们介绍了S1 Bench，这是一种新颖的基准测试，旨在评估大型推理模型（LRMS）的性能，这些任务有利于直觉系统1思考而不是审议系统2推理。尽管LRM通过明确的思想链在复杂的推理任务中取得了重大突破，但它们对深层分析思维的依赖可能会限制其系统1的思维能力。此外，目前存在缺乏基准来评估LRMS在需要此类功能的任务中的性能。为了填补这一空白，S1基础台面介绍了一组简单，多样，自然清晰的问题，这些问题跨多个领域和语言，专门设计用于评估LRMS在此类任务中的性能。我们对22个LRM的全面评估显示出明显的较低效率趋势，其输出平均比传统的小LLM长15.5倍。此外，LRM通常会尽早确定正确的答案，但继续进行不必要的审议，有些模型甚至会产生许多错误。这些发现突出了当前LRM的严格推理模式，并强调了获得平衡的双系统思维能力所需的实质性发展，这些思维能力可以适当地适应任务复杂性。

Title: LLM-driven Constrained Copy Generation through Iterative Refinement

Authors: Varun Vasudevan, Faezeh Akhavizadegan, Abhinav Prakash, Yokila Arora, Jason Cho, Tanya Mendiratta, Sushant Kumar, Kannan Achan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10391
Pdf URL: https://arxiv.org/pdf/2504.10391
Copy Paste: [[2504.10391]] LLM-driven Constrained Copy Generation through Iterative Refinement(https://arxiv.org/abs/2504.10391)
Keywords: llm
Abstract: Crafting a marketing message (copy), or copywriting is a challenging generation task, as the copy must adhere to various constraints. Copy creation is inherently iterative for humans, starting with an initial draft followed by successive refinements. However, manual copy creation is time-consuming and expensive, resulting in only a few copies for each use case. This limitation restricts our ability to personalize content to customers. Contrary to the manual approach, LLMs can generate copies quickly, but the generated content does not consistently meet all the constraints on the first attempt (similar to humans). While recent studies have shown promise in improving constrained generation through iterative refinement, they have primarily addressed tasks with only a few simple constraints. Consequently, the effectiveness of iterative refinement for tasks such as copy generation, which involves many intricate constraints, remains unclear. To address this gap, we propose an LLM-based end-to-end framework for scalable copy generation using iterative refinement. To the best of our knowledge, this is the first study to address multiple challenging constraints simultaneously in copy generation. Examples of these constraints include length, topics, keywords, preferred lexical ordering, and tone of voice. We demonstrate the performance of our framework by creating copies for e-commerce banners for three different use cases of varying complexity. Our results show that iterative refinement increases the copy success rate by $16.25-35.91$% across use cases. Furthermore, the copies generated using our approach outperformed manually created content in multiple pilot studies using a multi-armed bandit framework. The winning copy improved the click-through rate by $38.5-45.21$%.
摘要：制作营销信息（复制）或文案写作是一项挑战的一项任务，因为该副本必须遵守各种约束。复制创建本质上是人类迭代的，首先是最初的草稿，然后是连续的改进。但是，手动复制创建既费时又昂贵，每种用例仅产生几份副本。这种限制限制了我们将内容个性化的能力。与手动方法相反，LLM可以快速生成副本，但是生成的内容并不能始终如一地符合第一次尝试的所有约束（类似于人类）。尽管最近的研究表明，通过迭代改进来改善受约束的产生有望，但它们主要解决了只有几个简单约束的任务。因此，涉及许多复杂限制的诸如复制生成等任务的迭代改进的有效性尚不清楚。为了解决这一差距，我们提出了一个基于LLM的端到端框架，用于使用迭代改进来扩展复制生成。据我们所知，这是第一个在副本生成中同时解决多个具有挑战性的约束的研究。这些约束的示例包括长度，主题，关键字，首选的词汇顺序和语气。我们通过为三种不同复杂性不同用例的电子商务横幅创建副本来证明框架的性能。我们的结果表明，迭代精致将副本成功率提高了16.25-35.91美元，$％$％$％。此外，使用我们的方法生成的副本超过了多军强盗框架在多个试点研究中手动创建的内容。获胜副本将点击率提高了38.5-45.21美元。

Title: Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Authors: Diogo Sousa, Guilherme Barbosa, Catarina Rocha, Dulce Oliveira
Subjects: cs.CL, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10405
Pdf URL: https://arxiv.org/pdf/2504.10405
Copy Paste: [[2504.10405]] Performance of Large Language Models in Supporting Medical Diagnosis and Treatment(https://arxiv.org/abs/2504.10405)
Keywords: language model, llm, chain-of-thought
Abstract: The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.
摘要：大型语言模型（LLM）纳入医疗保健方面，具有提高诊断准确性并支持医疗计划的巨大潜力。这些AI驱动的系统可以分析大量数据集，帮助临床医生识别疾病，推荐治疗和预测患者的预后。这项研究评估了2024年葡萄牙国家医学专业访问（PNA）（PNA）的一系列当代LLM的性能，包括开源和封闭源模型，这是一项标准的医学知识评估。我们的结果突出了准确性和成本效益的很大差异，其中几种模型表明，在这项特定任务上，医学生的表现超过了人类的基准。我们根据准确性和成本的综合分数来确定领先的模型，讨论推理方法（例如思考链）的含义，并强调了LLMs作为有价值的互补工具的潜力，可以帮助医疗专业人员进行复杂的临床决策。

Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Authors: Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10415
Pdf URL: https://arxiv.org/pdf/2504.10415
Copy Paste: [[2504.10415]] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models(https://arxiv.org/abs/2504.10415)
Keywords: language model, llm
Abstract: Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
摘要：科学方程发现是科学进步史上的一项基本任务，实现了控制自然现象的法律的推导。最近，大型语言模型（LLMS）因其利用嵌入式科学知识的潜力来产生假设而引起了这项任务的兴趣。但是，评估这些方法的真实发现能力仍然具有挑战性，因为现有的基准通常依赖于容易受到LLMS记忆的常见方程式，从而导致不反映发现发现的膨胀性能指标。在本文中，我们介绍了LLM-SRBENCH，这是一个全面的基准，在四个科学领域中，专门设计用于评估基于LLM的科学方程发现方法的挑战性问题，同时预防琐碎的记忆。我们的基准分为两个主要类别：LSR-Transform，它将常见的物理模型转换为不常见的数学表示，以测试超出记忆形式的推理，而LSR-synth则引入了综合，发现驱动的问题，需要数据驱动的推理。通过对几种开放式LLM和封闭的LLM的多种最新方法的广泛评估，我们发现迄今为止最出色的系统只能达到31.5％的符号准确性。这些发现突出了科学方程发现的挑战，将LLM-SRBENCH定位为未来研究的宝贵资源。

Title: CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Authors: Jing Chen, Zhihua Wei, Wei Zhang, Yingying Hu, Qiong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10418
Pdf URL: https://arxiv.org/pdf/2504.10418
Copy Paste: [[2504.10418]] CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation(https://arxiv.org/abs/2504.10418)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM's interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.
摘要：大型语言模型（LLMS）由于其流利的互动能力和广泛的医学知识，因此在协助临床访谈方面拥有巨大的希望。但是，缺乏高质量的访谈对话数据和广泛接受的评估方法极大地阻碍了这一过程。因此，我们提出了Clinichat，该框架集成了多源知识，以使LLMS能够模拟现实世界中的临床访谈。它由两个模块组成：Clini-Recon和Clini-eval，每个模块分别负责重建和评估访谈对话。通过纳入三种知识来源，Clini-Recon将临床笔记转化为系统，专业和善解人意的访谈对话。 Clini-eval将全面的评估度量系统与两阶段自动评估方法相结合，使LLM能够评估像专家一样的访谈绩效。我们为高质量的合成访谈对话数据集和ClinichatGLM贡献MEDQA-DIAGOG，这是专门用于临床访谈的模型。实验结果表明，ClinichatGLM的面试能力进行了全面的升级，尤其是在历史上，实现最先进的表现。

Title: Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA

Authors: Michał Turski, Mateusz Chiliński, Łukasz Borchmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10419
Pdf URL: https://arxiv.org/pdf/2504.10419
Copy Paste: [[2504.10419]] Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA(https://arxiv.org/abs/2504.10419)
Keywords: language model
Abstract: Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: this https URL
摘要：复选框在现实世界中的文档处理中至关重要，在现实世界中，tick的存在或不存在直接为数据提取和决策过程提供了信息。然而，尽管在各种任务中，大型视力和语言模型的表现都很强，但他们在解释可检查内容方面很难。在单个被忽视的复选框可能导致昂贵的监管或合同监督的行业中，这一挑战尤其如此。为了解决此差距，我们介绍了CheckboxQA数据集，该数据集是一种目标资源，旨在评估和改善与复选框相关的任务上的模型性能。它揭示了当前模型的局限性，并是推进文档理解系统的宝贵工具，对法律技术和金融等部门的应用产生了重大影响。该数据集可公开可用：此HTTPS URL

Title: Can We Edit LLMs for Long-Tail Biomedical Knowledge?

Authors: Xinhao Yi, Jake Lever, Kevin Bryson, Zaiqiao Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10421
Pdf URL: https://arxiv.org/pdf/2504.10421
Copy Paste: [[2504.10421]] Can We Edit LLMs for Long-Tail Biomedical Knowledge?(https://arxiv.org/abs/2504.10421)
Keywords: language model, llm
Abstract: Knowledge editing has emerged as an effective approach for updating large language models (LLMs) by modifying their internal knowledge. However, their application to the biomedical domain faces unique challenges due to the long-tailed distribution of biomedical knowledge, where rare and infrequent information is prevalent. In this paper, we conduct the first comprehensive study to investigate the effectiveness of knowledge editing methods for editing long-tail biomedical knowledge. Our results indicate that, while existing editing methods can enhance LLMs' performance on long-tail biomedical knowledge, their performance on long-tail knowledge remains inferior to that on high-frequency popular knowledge, even after editing. Our further analysis reveals that long-tail biomedical knowledge contains a significant amount of one-to-many knowledge, where one subject and relation link to multiple objects. This high prevalence of one-to-many knowledge limits the effectiveness of knowledge editing in improving LLMs' understanding of long-tail biomedical knowledge, highlighting the need for tailored strategies to bridge this performance gap.
摘要：知识编辑已成为通过修改其内部知识来更新大型语言模型（LLM）的有效方法。但是，由于生物医学知识的长尾分布，它们在生物医学领域的应用面临独特的挑战，在这种情况下，很少有信息的信息很普遍。在本文中，我们进行了首次全面研究，以研究知识编辑方法在编辑长尾生物医学知识的有效性。我们的结果表明，尽管现有的编辑方法可以增强LLM在长尾生物医学知识方面的表现，但即使在编辑后，它们在高频普遍知识上的长尾知识的表现仍然不如高尾知识。我们的进一步分析表明，长尾生物医学知识包含大量一对多知识，其中一个主题和与多个对象的关系链接。一对多知识的高流行率限制了知识编辑在改善LLM对长尾生物医学知识的理解方面的有效性，从而强调了对缩小这一绩效差距的量身定制策略的需求。

Title: LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

Authors: Minqian Liu, Zhiyang Xu, Xinyi Zhang, Heajun An, Sarvech Qadir, Qi Zhang, Pamela J. Wisniewski, Jin-Hee Cho, Sang Won Lee, Ruoxi Jia, Lifu Huang
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10430
Pdf URL: https://arxiv.org/pdf/2504.10430
Copy Paste: [[2504.10430]] LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models(https://arxiv.org/abs/2504.10430)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.
摘要：大型语言模型（LLM）的最新进展使他们能够接触人类水平的说服力。但是，这种潜力也引起了人们对LLM驱动的说服的安全风险的担忧，尤其是通过操纵，欺骗，剥削脆弱性和许多其他有害策略，其潜在的不道德影响。在这项工作中，我们通过两个关键方面对LLM说服安全性进行了系统的调查：（1）LLM是否适当拒绝不道德的说服力任务并避免执行期间避免不道德的策略，包括最初的说服目标在道德上似乎是中性的，以及（2）如何影响人格特质和外部压力等影响力。为此，我们引入了Persusafety，这是评估说服安全的第一个综合框架，该框架包括三个阶段，即说服场景创建，说服力的对话模拟和说服安全评估。 PersuSafety涵盖了6种不同的不道德说服主题和15种常见的不道德策略。通过在8种广泛使用的LLM的广泛实验中，我们观察到大多数LLM的严重安全问题，包括未能确定有害的说服力任务并利用各种不道德的说服策略。我们的研究要求更多的关注以改善渐进式和目标驱动的对话（例如说服力）中的安全对准。

Title: xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Authors: Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10481
Pdf URL: https://arxiv.org/pdf/2504.10481
Copy Paste: [[2504.10481]] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations(https://arxiv.org/abs/2504.10481)
Keywords: gpt, llm
Abstract: With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.
摘要：随着OpenAI释放O1模型，采用缓慢思考策略的推理模型逐渐出现。由于这种模型产生的响应通常包括复杂的推理，中间步骤和自我反省，因此现有的评估方法通常不足。他们努力确定LLM输出是否真正等同于参考答案，并且很难从长期，复杂的答案中识别和提取最终答案。为了解决此问题，我们提出了Xverify，这是一种用于推理模型评估的有效答案验证程序。 Xverify表现出在等效判断中的强大能力，使其能够有效地确定推理模型产生的答案是否等效于在各种客观问题上引用答案。为了训练和评估Xverify，我们通过收集由多个LLM在各种数据集中生成的问题 - 答案对来构建VAR数据集，从而利用多个推理模型和具有挑战性的评估集，专门为推理模型评估而设计。采用多轮注释过程来确保标签精度。基于VAR数据集，我们训练多个不同尺度的XVerify模型。在对测试集和概括集进行的评估实验中，所有XVerify模型都达到了总体F1分数，精度超过95 \％。值得注意的是，最小的变体Xverify-0.5b-I优于除GPT-4O以外的所有评估方法，而Xverify-3B-IB在整体性能方面超过了GPT-4O。这些结果证明了Xverify的有效性和概括性。