2025-06-04

Title: Research on Medical Named Entity Identification Based On Prompt-Biomrc Model and Its Application in Intelligent Consultation System

Authors: Jinzhu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01961
Pdf URL: https://arxiv.org/pdf/2506.01961
Copy Paste: [[2506.01961]] Research on Medical Named Entity Identification Based On Prompt-Biomrc Model and Its Application in Intelligent Consultation System(https://arxiv.org/abs/2506.01961)
Keywords: language model, prompt
Abstract: This study is dedicated to exploring the application of prompt learning methods to advance Named Entity Recognition (NER) within the medical domain. In recent years, the emergence of large-scale models has driven significant progress in NER tasks, particularly with the introduction of the BioBERT language model, which has greatly enhanced NER capabilities in medical texts. Our research introduces the Prompt-bioMRC model, which integrates both hard template and soft prompt designs aimed at refining the precision and efficiency of medical entity recognition. Through extensive experimentation across diverse medical datasets, our findings consistently demonstrate that our approach surpasses traditional models. This enhancement not only validates the efficacy of our methodology but also highlights its potential to provide reliable technological support for applications like intelligent diagnosis systems. By leveraging advanced NER techniques, this study contributes to advancing automated medical data processing, facilitating more accurate medical information extraction, and supporting efficient healthcare decision-making processes.
摘要：这项研究致力于探索迅速学习方法在医疗领域内提高命名实体识别（NER）的应用。近年来，大规模模型的出现在NER任务中取得了重大进展，尤其是随着Biobert语言模型的引入，该模型在医学文本中的NER能力大大提高了。我们的研究介绍了及时的BIOMRC模型，该模型集成了硬模板和软提示设计，旨在完善医疗实体识别的精度和效率。通过跨不同医疗数据集的广泛实验，我们的发现始终证明我们的方法超过了传统模型。这种增强不仅验证了我们方法的功效，而且还强调了其为智能诊断系统等应用提供可靠的技术支持的潜力。通过利用先进的NER技术，这项研究有助于推进自动化的医疗数据处理，促进更准确的医疗信息提取以及支持有效的医疗保健决策过程。

Title: No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success

Authors: Lukas Rauch, Moritz Wirth, Denis Huseljic, Marek Herde, Bernhard Sick, Matthias Aßenmacher
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.01992
Pdf URL: https://arxiv.org/pdf/2506.01992
Copy Paste: [[2506.01992]] No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success(https://arxiv.org/abs/2506.01992)
Keywords: language model, llm
Abstract: The advent of large language models (LLMs) capable of producing general-purpose representations lets us revisit the practicality of deep active learning (AL): By leveraging frozen LLM embeddings, we can mitigate the computational costs of iteratively fine-tuning large backbones. This study establishes a benchmark and systematically investigates the influence of LLM embedding quality on query strategies in deep AL. We employ five top-performing models from the massive text embedding benchmark (MTEB) leaderboard and two baselines for ten diverse text classification tasks. Our findings reveal key insights: First, initializing the labeled pool using diversity-based sampling synergizes with high-quality embeddings, boosting performance in early AL iterations. Second, the choice of the optimal query strategy is sensitive to embedding quality. While the computationally inexpensive Margin sampling can achieve performance spikes on specific datasets, we find that strategies like Badge exhibit greater robustness across tasks. Importantly, their effectiveness is often enhanced when paired with higher-quality embeddings. Our results emphasize the need for context-specific evaluation of AL strategies, as performance heavily depends on embedding quality and the target task.
摘要：能够产生通用用途表示的大型语言模型（LLM）的出现使我们能够重新审视深度积极学习的实用性（AL）：通过利用冷冻的LLM嵌入，我们可以减轻迭代性微调大型骨干的计算成本。这项研究建立了基准，并系统地研究了LLM嵌入质量对Deep Al中查询策略的影响。我们从大规模的文本嵌入基准（MTEB）排行榜和两个基准中采用五个最佳模型，用于十种不同的文本分类任务。我们的发现揭示了关键的见解：首先，使用基于多样性的采样来初始化标记的池，以高质量的嵌入协同作用，从而在早期迭代中提高了性能。其次，最佳查询策略的选择对嵌入质量很敏感。尽管计算廉价的保证金采样可以在特定数据集上实现性能峰值，但我们发现诸如徽章之类的策略在整个任务之间表现出更大的鲁棒性。重要的是，当与高质量的嵌入配对时，它们的有效性通常会提高。我们的结果强调了对AL策略进行特定于上下文评估的需求，因为性能在很大程度上取决于嵌入质量和目标任务。

Title: NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

Authors: Abhay Gupta, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02000
Pdf URL: https://arxiv.org/pdf/2506.02000
Copy Paste: [[2506.02000]] NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts(https://arxiv.org/abs/2506.02000)
Keywords: language model, llm
Abstract: Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate k1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate six state-of-the-art (SOTA) models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We noticed consistent accuracy drops with increased hops and context length, even in frontier models-revealing that sheer scale does not guarantee robust reasoning. Our failure mode analysis highlights common breakdowns, such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to stress-test multi-hop reasoning at scale.
摘要：当前的大型语言模型（LLMS）难以回答跨越数万个令牌的问题，尤其是在涉及多跳上推理的情况下。虽然先前的基准测试探索了长篇小说理解或孤立的多跳上推理，但在自然叙事环境中，没有共同的上下文长度和推理深度。我们介绍了NovelHopqa，这是第一个评估K1-4 Hop QA超过64K-128K token摘录的基准，来自83台全长公共域小说。一个关键字引导的管道建立了以连贯的故事情节为基础的跳跃链。我们评估了六个最先进的模型（SOTA）模型，并应用Oracle-Context过滤，以确保所有问题都可以真正回答。人类注释者可以验证对齐和跳跃深度。我们注意到，即使在边境模型中，纯粹的规模也不能保证稳定的推理，啤酒花和上下文长度的稳定准确性下降也会增加。我们的故障模式分析突出了常见分解，例如错过的最终跳跃集成和远程漂移。 NovelHopQA提供了受控的诊断设置，以大规模应力测试多跳的推理。

Title: Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

Authors: Christopher Lee Lübbers
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02018
Pdf URL: https://arxiv.org/pdf/2506.02018
Copy Paste: [[2506.02018]] Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data(https://arxiv.org/abs/2506.02018)
Keywords: language model
Abstract: Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.
摘要：释义重新表达含义，以增强应用程序，例如简化文本，机器翻译和提问。特定的释义类型有助于准确的语义分析和强大的语言模型。但是，由于依赖自动指标和有限的人类宣传训练数据，现有的释义类型生成方法通常与人类偏好失调，掩盖了语义忠诚和语言转换的关键方面。这项研究通过利用人级的释义类型数据集并将直接偏好优化（DPO）直接与人类判断保持一致，以解决这一差距。基于DPO的培训将隔壁类型的生成准确度提高了3个百分点，而在受监督的基准中，将人类的偏好评级提高了7个百分点。新创建的人类注销数据集支持更严格的未来评估。此外，释义型检测模型的添加/缺失得分为0.91，相同极性取代的0.78，标点符号变化为0.70。这些发现表明，偏好数据和DPO培训会产生更可靠的，语义上准确的释义，从而实现了下游应用程序，例如改进的摘要和更强大的提问。 PTD模型超过了自动指标，并为评估释义质量的框架提供了一个更可靠的框架，将解释型型研究推进了更丰富，用户对准语言的生成，并为以人为中心的标准建立了更强的未来评估基础。

Title: ChatCFD: an End-to-End CFD Agent with Domain-specific Structured Thinking

Authors: E Fan, Weizong Wang, Tianhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02019
Pdf URL: https://arxiv.org/pdf/2506.02019
Copy Paste: [[2506.02019]] ChatCFD: an End-to-End CFD Agent with Domain-specific Structured Thinking(https://arxiv.org/abs/2506.02019)
Keywords: language model, prompt, chat, agent
Abstract: Computational Fluid Dynamics (CFD) is essential for scientific and engineering advancements but is limited by operational complexity and the need for extensive expertise. This paper presents ChatCFD, a large language model-driven pipeline that automates CFD workflows within the OpenFOAM framework. It enables users to configure and execute complex simulations from natural language prompts or published literature with minimal expertise. The innovation is its structured approach to database construction, configuration validation, and error reflection, integrating CFD and OpenFOAM knowledge with general language models to improve accuracy and adaptability. Validation shows ChatCFD can autonomously reproduce published CFD results, handling complex, unseen configurations beyond basic examples, a task challenging for general language models.
摘要：计算流体动力学（CFD）对于科学和工程进步至关重要，但受运营复杂性和对广泛专业知识的需求的限制。本文介绍了CHATCFD，这是一种大型语言模型驱动的管道，可在OpenFOAM框架内自动化CFD工作流程。它使用户能够从自然语言提示或以最少的专业知识发布的自然语言提示或发布文献中进行复杂的模拟。创新是其用于数据库构建，配置验证和错误反射的结构化方法，将CFD和OpenFoam知识与一般语言模型集成在一起，以提高准确性和适应性。验证表明，CHATCFD可以自主再现已发布的CFD结果，处理复杂的，看不见的配置，而不是基本示例，这是对通用语言模型的挑战。

Title: FinS-Pilot: A Benchmark for Online Financial System

Authors: Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02037
Pdf URL: https://arxiv.org/pdf/2506.02037
Copy Paste: [[2506.02037]] FinS-Pilot: A Benchmark for Online Financial System(https://arxiv.org/abs/2506.02037)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various professional domains, with their performance typically evaluated through standardized benchmarks. However, the development of financial RAG benchmarks has been constrained by data confidentiality issues and the lack of dynamic data integration. To address this issue, we introduces FinS-Pilot, a novel benchmark for evaluating RAG systems in online financial applications. Constructed from real-world financial assistant interactions, our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework covering critical financial domains such as equity analysis and macroeconomic forecasting. The benchmark enables comprehensive evaluation of financial assistants' capabilities in handling both static knowledge and time-sensitive market information. Through systematic experiments with multiple Chinese leading LLMs, we demonstrate FinS-Pilot's effectiveness in identifying models suitable for financial applications while addressing the current gap in specialized evaluation tools for the financial domain. Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems. The code and dataset are accessible on GitHub\footnote{this https URL\_rag\_benchmark}.
摘要：大型语言模型（LLM）表现出各种专业领域的显着功能，其性能通常通过标准化的基准进行评估。但是，财务破布基准的开发受到数据机密性问题和缺乏动态数据集成的限制。为了解决这个问题，我们介绍了Fins-Pilot，这是一种用于评估在线财务应用中抹布系统的新基准。我们的基准由现实世界中的财务助理互动构建，包括实时API数据和结构化的文本源，该基础通过意图分类框架组织，涵盖了关键的金融领域，例如股票分析和宏观经济预测。该基准可以全面评估财务助手在处理静态知识和时间敏感市场信息方面的能力。通过对多个中国领先的LLM的系统实验，我们证明了Fins-Pilot在确定适合金融应用模型的有效性，同时解决了金融领域的专门评估工具的当前差距。我们的工作既可以贡献一个实用的评估框架，又是一个精心策划的数据集，可以推进金融NLP系统的研究。代码和数据集可在github \ footNote {此https url \ _rag \ _benchmark}上访问。

Title: Enhancing Multimodal Continual Instruction Tuning with BranchLoRA

Authors: Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, Jinfeng Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02041
Pdf URL: https://arxiv.org/pdf/2506.02041
Copy Paste: [[2506.02041]] Enhancing Multimodal Continual Instruction Tuning with BranchLoRA(https://arxiv.org/abs/2506.02041)
Keywords: language model, llm
Abstract: Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.
摘要：多模式连续指导调整（MCIT）旨在填补多模型模型（MLLM），以不断与跨顺序任务的人类意图保持一致。现有的方法通常依赖于专家的混合物（MOE）LORA框架来保留先前的说明一致性。但是，这些方法容易遭受灾难性的遗忘（CF），因为它们通过简单的求和汇总了所有Lora块，这会随着时间的流逝而损害性能。在本文中，我们在MCIT上下文中确定了Moelora框架中的关键参数效率低下。基于这种见解，我们提出了Branchlora，这是一个不对称的框架，以提高效率和性能。为了减轻CF，我们在Branchlora中引入了一种灵活的调整冻结机制，使分支机构在培养任务间协作的同时专门研究任务内的知识。此外，我们会逐步合并特定于任务的路由器，以确保随着时间的推移的最佳分支分配，而不是赞成最近的任务。要简化推理，我们将一个任务选择器引入，该任务选择器自动将输入路由器路由的路由器路由器路由器路由器路由器路由器路由器置于而无需任务标识。对最新MCIT基准测试的广泛实验表明，Branchlora显着超过Moelora，并在各种MLLM尺寸上保持其优势。

Title: Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?

Authors: Xiang Li, Jiayi Xin, Qi Long, Weijie J. Su
Subjects: cs.CL, cs.IR, cs.LG, stat.AP, stat.ME
Abstract URL: https://arxiv.org/abs/2506.02058
Pdf URL: https://arxiv.org/pdf/2506.02058
Copy Paste: [[2506.02058]] Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?(https://arxiv.org/abs/2506.02058)
Keywords: language model, llm
Abstract: Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this \textit{evaluation crisis} is the oversight of unseen knowledge -- information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information retrieval effectiveness, and measuring output diversity. Our experiments reveal that a substantial volume of knowledge is omitted when relying solely on observed LLM performance. Importantly, KnowSum yields significantly different comparative rankings for several common LLMs based on their internal knowledge.
摘要：对大语言模型（LLM）的准确评估对于理解其能力和指导其发展至关重要。但是，当前的评估通常不一致地反映了这些模型的实际能力。在本文中，我们证明了这一\ textit {评估危机}的众多因素之一是对看不见的知识的监督 - LLMS编码但未直接观察到或未观察到的信息。我们介绍了众所周知，这是一个统计框架，旨在通过量化一类评估任务的看不见的知识来提供更全面的评估。俯卧撑通过从观察到的知识实例的外观频率外推断出来估计未观察到的部分。我们证明了在三个关键应用程序中识别的有效性和实用性：估计总知识，评估信息检索有效性以及衡量产出多样性。我们的实验表明，仅依靠观察到的LLM性能时省略了大量知识。重要的是，众所周知，基于其内部知识，几个常见的LLM的比较排名显着不同。

Title: Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains

Authors: Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, Yuyin Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02126
Pdf URL: https://arxiv.org/pdf/2506.02126
Copy Paste: [[2506.02126]] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains(https://arxiv.org/abs/2506.02126)
Keywords: language model, llm
Abstract: Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
摘要：推理增强的大语言模型（例如OpenAI-O1/3和DeepSeek-R1）的最新进展已大大提高了复杂任务的性能。但是，其内部推理过程的质量和透明度仍未得到充实。这项工作超越了最终的准确性，并通过将思维轨迹分为两个部分：知识和推理，研究了医学和数学领域中的逐步推理。具体而言，我们介绍了一个判断的细粒度评估框架：（1）所使用的知识的正确性（通过知识指数（ki）衡量）和（2）推理质量（通过信息增益（信息增益（Infogain）衡量））。使用此框架，我们研究了医学和数学领域中有监督的微调（SFT）和/或增强学习（RL）训练的R1启动和基本QWEN模型。出现了三个有趣的发现：（1）R1延伸模型中的一般推理能力不会通过SFT或RL有效地转移到医疗领域。（2）SFT在两个域中提高了最终的准确性，但通常以推理质量为代价：与未经训练的模型相比，Infogain平均下降了38.9％；但是，在医疗领域中，SFT仍然至关重要，因为域知识是必不可少的。（3）RL通过从推理路径中修剪不准确或无关的知识来增强医学推理，从而提高推理准确性和知识正确性。

Title: Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

Authors: Michael Li, Nishant Subramani
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02132
Pdf URL: https://arxiv.org/pdf/2506.02132
Copy Paste: [[2506.02132]] Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models(https://arxiv.org/abs/2506.02132)
Keywords: language model, gpt, llm
Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how both classical architectures (BERT, DeBERTa, GPT-2)and contemporary large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) represent lexical identity and inflectional morphology. We train linear and nonlinear classifiers on layer-wise activations to predict word lemmas and inflectional features. We discover that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout the layers. Further analysis reveals that these models encode inflectional morphology through generalizable abstractions, but rely predominantly on memorization to encode lexical identity. Remarkably, these patterns emerge across all 16 models we test, despite differences in architecture, size, and training regime (including pretrained and instruction-tuned variants). This consistency suggests that, despite substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties could be fundamental for next token prediction and are learned early during pretraining. Our code is available at this https URL.
摘要：基于变压器的大型语言模型主导了现代NLP，但是我们对它们如何编码语言信息的理解源于对Bert和GPT-2等早期模型的研究。为了更好地理解当今的语言模型，我们研究了古典体系结构（Bert，Deberta，GPT-2）和当代大型语言模型（Pythia，Olmo-2，Gemma-2，Qwen2.5，Llama-3.1）如何代表词汇认同和拐点形态。我们在层次激活上训练线性和非线性分类器，以预测单词引理和拐点特征。我们发现，模型在早期的层次上线性地集中词汇信息，并且在后来的层中越来越非线性，同时保持拐点信息在整个层中均匀访问且可分离。进一步的分析表明，这些模型通过可推广的抽象编码拐点形态，但主要依靠记忆来编码词汇认同。值得注意的是，尽管建筑，规模和训练制度差异（包括经过预处理和指导调节的变体），但我们测试的所有16个模型都出现了这些模式。这种一致性表明，尽管LLM技术取得了长足的进步，但变压器模型以相似的方式组织了语言信息，表明这些特性对于接下来的令牌预测可能是基本的，并且在预审进期间早期就学会了。我们的代码可在此HTTPS URL上找到。

Title: BabyLM's First Constructions: Causal interventions provide a signal of learning

Authors: Joshua Rozner, Leonie Weissweiler, Cory Shain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02147
Pdf URL: https://arxiv.org/pdf/2506.02147
Copy Paste: [[2506.02147]] BabyLM's First Constructions: Causal interventions provide a signal of learning(https://arxiv.org/abs/2506.02147)
Keywords: language model
Abstract: Construction grammar posits that children acquire constructions (form-meaning pairings) from the statistics of their environment. Recent work supports this hypothesis by showing sensitivity to constructions in pretrained language models (PLMs), including one recent study (Rozner et al., 2025) demonstrating that constructions shape the PLM's output distribution. However, models under study have generally been trained on developmentally implausible amounts of data, casting doubt on their relevance to human language learning. Here we use Rozner et al.'s methods to evaluate constructional learning in models from the 2024 BabyLM challenge. Our results show that even when trained on developmentally plausible quantities of data, models represent diverse constructions, even hard cases that are superficially indistinguishable. We further find correlational evidence that constructional performance may be functionally relevant: models that better represent constructions perform better on the BabyLM benchmarks.
摘要：施工语法认为，儿童从其环境统计数据中获取建筑（表格融合配对）。最近的工作通过在预审前的语言模型（PLM）中表现出对构造的敏感性来支持这一假设，其中包括最近的一项研究（Rozner等，2025），表明构造构成了PLM的输出分布。但是，所研究的模型通常已经接受了有关发展数量的数据的培训，对它们与人类语言学习的相关性产生了怀疑。在这里，我们使用Rozner等人的方法来评估2024年Babylm挑战模型中的建构学习。我们的结果表明，即使经过开发数量的数据培训，模型也代表了不同的结构，甚至是无法表面区分的硬病例。我们进一步发现相关证据表明构造性能可能在功能上相关：更好地表示构造的模型在BABYLM基准测试上表现更好。

Title: BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models

Authors: Lindia Tjuatja, Graham Neubig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02204
Pdf URL: https://arxiv.org/pdf/2506.02204
Copy Paste: [[2506.02204]] BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models(https://arxiv.org/abs/2506.02204)
Keywords: language model, prompt
Abstract: Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs. Specifically, BehaviorBox finds features that describe groups of words in fine-grained contexts, such as "conditional 'were' in the phrase 'if you were'" and "exclamation marks after emotional statements", where one model outperforms another within a particular datatset. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.
摘要：语言模型评估是一项艰巨的任务：提示是脆弱的，语料库级的困惑含糊不清，并且基准的选择是无限的。发现显示两个LMS之间有意义，可概括的差异的示例对于理解一个模型成功的位置和另一个模型失败至关重要。可以自动完成此过程吗？在这项工作中，我们提出了使用语言模型自动比较的方法，该模型使用性能感知的上下文嵌入来查找文本的细颗粒特征，其中一个LM胜过另一个。我们命名行为盒的方法提取了连贯的特征，这些特征证明了两个LMS之间的生成易发的差异。具体而言，行为盒找到了在细颗粒上下文中描述单词组的功能，例如“条件”是在“短语”中'如果您是'”和“情感语句后的感叹号”，其中一个模型在特定数据集中胜过另一个模型。我们将行为箱应用于比较大小，模型家族和训练后的模型，并将洞察力列举到特定环境中，这些洞察力说明了性能的有意义差异，这是仅通过诸如语料库级别的困惑之类的措施找到的。

Title: Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics

Authors: Ella Rannon, David Burstein
Subjects: cs.CL, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2506.02212
Pdf URL: https://arxiv.org/pdf/2506.02212
Copy Paste: [[2506.02212]] Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics(https://arxiv.org/abs/2506.02212)
Keywords: language model
Abstract: Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.
摘要：自然语言处理（NLP）通过将最初为人类语言开发的技术应用于生物序列的分析，从而改变了语言学以外的各个领域。这篇评论探讨了NLP方法在生物序列数据中的应用，重点是基因组学，转录组学和蛋白质组学。我们研究了各种NLP方法，从诸如Word2Vec之类的经典方法到采用变压器和鬣狗算子的高级模型，如何适应分析DNA，RNA，蛋白质序列和整个基因组。该综述还检查了令牌化策略和模型架构，评估其对不同生物学任务的优势，局限性和适合性。我们进一步涵盖了NLP应用程序的最新进展，例如结构预测，基因表达和进化分析，突出了这些方法从大规模基因组数据中提取有意义见解的潜力。随着语言模型继续发展，它们融入生物信息学具有巨大的希望，可以促进我们对生命所有领域的生物过程的理解。

Title: Investigating the Impact of Word Informativeness on Speech Emotion Recognition

Authors: Sofoklis Kakouros
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02239
Pdf URL: https://arxiv.org/pdf/2506.02239
Copy Paste: [[2506.02239]] Investigating the Impact of Word Informativeness on Speech Emotion Recognition(https://arxiv.org/abs/2506.02239)
Keywords: language model
Abstract: In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.
摘要：在言语中的情感识别中，一个关键挑战在于确定具有最相关的声学变化来辨别特定情绪的语音信号段。传统方法在整个句子或更长的语音部分中为能量和F0等功能计算功能，这可能会缺少长形式统计的基本细粒差异。这项研究调查了从预训练的语言模型中得出的单词信息性的使用，以识别语义上重要的细分。然后，专门针对这些已确定的段来计算声学特征，从而提高情绪识别精度。该方法利用标准的声学韵律特征，其功能和自我监管的表示。结果表明，当基于单词信息性选择的段中计算特征时，识别性能有了显着改善，强调了这种方法的有效性。

Title: CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment

Authors: Radin Shayanfar, Chu Fei Luo, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02264
Pdf URL: https://arxiv.org/pdf/2506.02264
Copy Paste: [[2506.02264]] CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment(https://arxiv.org/abs/2506.02264)
Keywords: llm
Abstract: It is often challenging to teach specialized, unseen tasks to dialogue systems due to the high cost of expert knowledge, training data, and high technical difficulty. To support domain-specific applications - such as law, medicine, or finance - it is essential to build frameworks that enable non-technical experts to define, test, and refine system behaviour with minimal effort. Achieving this requires cross-disciplinary collaboration between developers and domain specialists. In this work, we introduce a novel framework, CoDial (Code for Dialogue), that converts expert knowledge, represented as a novel structured heterogeneous graph, into executable conversation logic. CoDial can be easily implemented in existing guardrailing languages, such as Colang, to enable interpretable, modifiable, and true zero-shot specification of task-oriented dialogue systems. Empirically, CoDial achieves state-of-the-art performance on the STAR dataset for inference-based models and is competitive with similar baselines on the well-known MultiWOZ dataset. We also demonstrate CoDial's iterative improvement via manual and LLM-aided feedback, making it a practical tool for expert-guided alignment of LLMs in high-stakes domains.
摘要：由于专家知识，培训数据和高技术困难的高昂成本，教授专业，看不见的任务通常是具有挑战性的。为了支持特定领域的应用程序（例如法律，医学或金融），必须建立使非技术专家以最少的努力来定义，测试和完善系统行为的框架。实现这一目标需要开发人员与域专家之间的跨学科合作。在这项工作中，我们介绍了一个新颖的框架，即代码（对话代码），该框架将专家知识转换为一种新型结构化异构图，并将其转换为可执行的对话逻辑。可以在现有的护栏语言（例如Colang）中轻松实现代码，以启用面向任务的对话系统的可解释，可修改和真实的零摄像规范。从经验上讲，代码在Star数据集上实现了基于推理的模型的最新性能，并且在众所周知的Multiwoz数据集上具有类似基线的竞争力。我们还通过手动和LLM辅助反馈展示了Codial的迭代改进，这使其成为高风险域中LLM的专家指导对齐的实用工具。

Title: ImpRAG: Retrieval-Augmented Generation with Implicit Queries

Authors: Wenzheng Zhang, Xi Victoria Lin, Karl Stratos, Wen-tau Yih, Mingda Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02279
Pdf URL: https://arxiv.org/pdf/2506.02279
Copy Paste: [[2506.02279]] ImpRAG: Retrieval-Augmented Generation with Implicit Queries(https://arxiv.org/abs/2506.02279)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.
摘要：检索增强的生成（RAG）系统传统上将检索和生成视为单独的过程，需要明确的文本查询才能连接它们。这种分离可以限制模型跨越不同任务的能力。在这项工作中，我们提出了一个名为Impag的无查询抹布系统，该系统将检索和生成整合到统一模型中。 Imbag允许模型隐式表达其信息需求，从而消除了对人类指定的查询的需求。通过将预验证的仅解码语言模型分为专业层组，即可同时优化检索和生成任务。我们的方法使用相同的模型参数采用了两个阶段的推理过程，并用于检索和生成，从而最大程度地减少了检索器和语言模型之间的差异。对8个知识密集型任务进行的实验表明，即兴创造了3.6-11.5在具有不同格式的看不见的任务的精确匹配分数方面的改进，突出了其在使模型表达自己的信息需求并跨任务中概括的模型方面的有效性。我们的分析强调了平衡检索和发电参数以及利用发电困惑的重要性，以获取培训目标以增强性能。

Title: LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

Authors: Thai Hoang, Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02298
Pdf URL: https://arxiv.org/pdf/2506.02298
Copy Paste: [[2506.02298]] LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback(https://arxiv.org/abs/2506.02298)
Keywords: language model, llm, agent
Abstract: Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3\% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR's efficiency and effectiveness in speeding up development of AI agents.
摘要：针对AI代理的大型动作模型（LAM）具有令人难以置信的潜力，但由于需要高质量的培训数据而面临挑战，尤其是对于涉及计划，执行工具呼叫和回应反馈的多步骤任务。为了解决这些问题，我们提出了LAM Simulator，这是一个综合框架，旨在在线探索具有高质量反馈的代理任务。我们的框架具有动态的任务查询生成器，大量工具集合以及大型语言模型（LLM）代理可以调用工具并接收实时反馈的交互式环境。该设置使LLM代理能够自主探索和解决任务，从而促进发现多种方法以应对任何给定任务。然后，将所得的动作轨迹数据用于创建针对LAM的高质量训练数据集。我们对流行的代理基准，工具台和CRMARENA进行的实验强调了LAM Simulator的有效性：使用我们的框架培训的模型通过我们的框架训练，可实现显着的性能提高，高达49.3％的速度比其原始盆地提高了49.3％。 LAM模拟器需要在创建数据集期间的人类输入最少，这突出了Lam Simulator在加快AI代理开发方面的效率和有效性。

Title: Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments

Authors: Russell Scheinberg, Ameeta Agrawal, Amber Shore, So Young Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02302
Pdf URL: https://arxiv.org/pdf/2506.02302
Copy Paste: [[2506.02302]] Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments(https://arxiv.org/abs/2506.02302)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) can explain grammatical rules, yet they often fail to apply those rules when judging sentence acceptability. We present "grammar prompting", an explain-then-process paradigm: a large LLM first produces a concise explanation of the relevant syntactic phenomenon, then that explanation is fed back as additional context to the target model -- either an LLM or a smaller language model (SLM) -- before deciding which sentence of a minimal pair is grammatical. On the English BLiMP, Chinese SLING, and Russian RuBLiMP benchmarks, this simple prompt design yields substantial improvements over strong baselines across many syntactic phenomena. Feeding an LLM's metalinguistic explanation back to the target model bridges the gap between knowing a rule and using it. On SLMs, grammar prompting alone trims the average LLM-SLM accuracy gap by about 20%, and when paired with chain-of-thought, by 56% (13.0 pp -> 5.8 pp), all at negligible cost. The lightweight, language-agnostic cue lets low-cost SLMs approach frontier-LLM performance in multilingual settings.
摘要：大型语言模型（LLM）可以解释语法规则，但是在判断句子可接受性时，它们通常无法应用这些规则。我们提出了“语法提示”，这是一种解释性的范式：大型LLM首先对相关的句法现象产生简明的解释，然后将解释作为目标模型的附加背景提供了反馈，即LLM或较小的语言模型（SLM） - 在确定哪些句子的句子中是minimal corgal corminal cormimal cormimal cormimal is Grammatimal is Grammantal is Grammant。在英国飞艇，中国吊带和俄罗斯rublimp基准上，这种简单的提示设计对许多句法现象的强大基线进行了实质性改进。将LLM的元语言解释馈回目标模型，弥合了知道规则和使用规则之间的差距。在SLMS上，语法仅促使单独修剪平均LLM-SLM精度差距约为20％，并且与经过三通链搭配时，将达到56％（13.0 pp-> 5.8 pp），所有这些费用都忽略不计。轻巧的，语言不合Stic的提示使低成本SLM在多语言设置中接近边境-LLM的性能。

Title: Something Just Like TRuST : Toxicity Recognition of Span and Target

Authors: Berk Atil, Namrata Sureddy, Rebecca J. Passonneau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02326
Pdf URL: https://arxiv.org/pdf/2506.02326
Copy Paste: [[2506.02326]] Something Just Like TRuST : Toxicity Recognition of Span and Target(https://arxiv.org/abs/2506.02326)
Keywords: language model, llm, prompt
Abstract: Toxicity in online content, including content generated by language models, has become a critical concern due to its potential for negative psychological and social impact. This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection that merges existing datasets, and has labels for toxicity, target social group, and toxic spans. It includes a diverse range of target groups such as ethnicity, gender, religion, disability, and politics, with both human/machine-annotated and human machine-generated data. We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction. We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups. Further, reasoning capabilities do not significantly improve performance, indicating that LLMs have weak social reasoning skills.
摘要：在线内容中的毒性，包括语言模型产生的内容，由于其对心理和社会影响的潜在潜力，已成为关键问题。本文介绍了Trust，这是一个综合数据集，旨在改善合并现有数据集的毒性检测，并具有毒性，靶向社会群体和有毒跨度的标签。它包括种族，性别，宗教，残疾和政治等各种目标群体，以及人类/机器注释和人类机器生成的数据。我们在毒性检测，目标群体识别和毒性跨度提取方面基准了最先进的大语言模型（LLMS）。我们发现，微调模型的表现始终超过零拍摄，并且很少发生，尽管某些社交群体的性能仍然很低。此外，推理能力并不能显着提高绩效，表明LLM的社会推理技能较弱。

Title: One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

Authors: Hyungjoo Chae, Dongjin Kang, Jihyuk Kim, Beong-woo Kwak, Sunghyun Park, Haeju Park, Jinyoung Yeo, Moontae Lee, Kyungjae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02338
Pdf URL: https://arxiv.org/pdf/2506.02338
Copy Paste: [[2506.02338]] One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL(https://arxiv.org/abs/2506.02338)
Keywords: language model, llm, chain-of-thought
Abstract: With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.
摘要：随着R1的发布，公开可用的大型推理模型（LRM），研究人员通常通过培训R1的长期思想链（COT）推论来培训新的LRMS。虽然先前的工作表明可以通过直接蒸馏来复制LRMS的功能，但持续依赖现有模型（例如R1）仍然是前进该领域的关键限制。作为迈向独立LRM开发的第一步，本文探讨了使用未经推理时间缩放训练的LLM构建长COT数据集的可能性。为此，我们介绍了Long Cot Collection，这是使用现有短COT LLMS注释的100K COT原理数据集。我们开发了一条管道，该管道将O1的新颖推理策略诱导到简短的COT LLM中，使他们能够更长的思考并引入对思想预算的可控性，以更好地管理过度思考的问题。我们的广泛分析验证了我们的数据集可以达到质量，可与或略低于-r1相当。此外，我们的实验表明，在数据集中进行的培训不仅可以增强一般推理能力，而且还为增强学习提供了坚实的基础 - 初始化的数据启动了我们的数据，可以通过RLVR实现2-3x的增长。

Title: Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection

Authors: Herun Wan, Jiaying Wu, Minnan Luo, Zhi Zeng, Zhixiong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02350
Pdf URL: https://arxiv.org/pdf/2506.02350
Copy Paste: [[2506.02350]] Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection(https://arxiv.org/abs/2506.02350)
Keywords: language model, llm, prompt
Abstract: Misinformation detection models often rely on superficial cues (i.e., \emph{shortcuts}) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation through simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose SMF, an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, encouraging models to rely on deeper semantic understanding rather than shortcut cues. To promote the development of misinformation detectors, we have published the resources publicly at this https URL.
摘要：错误的信息检测模型通常依赖于表面提示（即\ emph {快捷方式}），这些提示与训练数据中的错误信息相关，但未能推广到现实世界中错误信息的多样性和不断发展的性质。大语言模型（LLMS）加剧了这个问题，可以通过简单的提示轻松地产生令人信服的错误信息。我们介绍了Truthovertricks，这是一种统一的评估范式，用于测量错误信息检测中的快捷方式学习。 TruthOverTricks将快捷方式行为分为固有的快捷式诱导和外部快捷方式注入，并评估了14个流行基准的七个代表性探测器，以及两个新的事实误导数据集，NQ-MisInfo和流媒体 - 媒介。经验结果表明，现有检测器在暴露于自然发生和对抗性捷径时遭受严重的性能降解。为了解决这个问题，我们提出了SMF，这是一个llm a的数据增强框架，该框架通过释义，事实摘要和情感归一化来减轻快捷方式的依赖。 SMF始终增强了16个基准的鲁棒性，鼓励模型依靠更深入的语义理解而不是快捷线索。为了促进错误信息检测器的发展，我们在此HTTPS URL上公开发表了资源。

Title: DIAMOND: An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization

Authors: Jeonghun Kang, Soonmok Kwon, Joonseok Lee, Byung-Hak Kim
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02351
Pdf URL: https://arxiv.org/pdf/2506.02351
Copy Paste: [[2506.02351]] DIAMOND: An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization(https://arxiv.org/abs/2506.02351)
Keywords: llm, agent
Abstract: Traditional approaches -- such as Win Probability Added (WPA)-based ranking or computer vision-driven event detection -- can identify scoring plays but often miss strategic depth, momentum shifts, and storyline progression. Manual curation remains the gold standard but is resource-intensive and not scalable. We introduce DIAMOND, an LLM-driven agent for context-aware baseball highlight summarization that integrates structured sports analytics with natural language reasoning. DIAMOND leverages sabermetric features -- Win Expectancy, WPA, and Leverage Index -- to quantify play importance, while an LLM module enhances selection based on contextual narrative value. This hybrid approach ensures both quantitative rigor and qualitative richness, surpassing the limitations of purely statistical or vision-based systems. Evaluated on five diverse Korean Baseball Organization League games, DIAMOND improves F1-score from 42.9% (WPA-only) to 84.8%, outperforming both commercial and statistical baselines. Though limited in scale, our results highlight the potential of modular, interpretable agent-based frameworks for event-level summarization in sports and beyond.
摘要：传统方法（例如添加了基于WPA）的WIN概率（基于WPA）的排名或计算机视觉驱动的事件检测 - 可以识别得分戏，但通常会错过战略深度，动量变化和故事情节的进步。手动策划仍然是黄金标准，但是资源密集的，不可扩展的。我们介绍了Diamond，这是LLM驱动的代理，用于上下文感知棒球突出显示将结构化的运动分析与自然语言推理相结合的摘要。钻石利用Sabermetric特征 - 赢得预期，WPA和利用指数 - 以量化游戏重要性，而LLM模块根据上下文叙事值增强选择。这种混合方法可确保定量严格和定性丰富度，超过了纯粹的统计或基于视觉的系统的局限性。戴蒙德（Diamond）对五种韩国棒球组织联赛比赛进行了评估，将F1分数从42.9％（仅WPA）提高到84.8％，表现优于商业和统计基线。尽管规模有限，但我们的结果突出了模块化，基于可解释的代理的框架的潜力，用于体育及其他地区的事件级别摘要。

Title: AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output

Authors: Hisami Suzuki, Satoru Katsumata, Takashi Kodama, Tetsuro Takahashi, Kouta Nakayama, Satoshi Sekine
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02372
Pdf URL: https://arxiv.org/pdf/2506.02372
Copy Paste: [[2506.02372]] AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output(https://arxiv.org/abs/2506.02372)
Keywords: llm
Abstract: In this paper we present AnswerCarefully, a dataset for promoting the safety and appropriateness of Japanese LLM outputs. The dataset consists of 1,800 pairs of questions and reference answers, where the questions require special attention in answering. It covers a wide range of risk categories established in prior English-language datasets, but the data samples are original in that they are manually created to reflect the socio-cultural context of LLM usage in Japan. We show that using this dataset for instruction to fine-tune a Japanese LLM led to improved output safety without compromising the utility of general responses. We also report the results of a safety evaluation of 12 Japanese LLMs using this dataset as a benchmark. Finally, we describe the latest update on the dataset which provides English translations and annotations of the questions, aimed at facilitating the derivation of similar datasets in different languages and regions.
摘要：在本文中，我们介绍了回答，这是一个促进日本LLM产出的安全性和适当性的数据集。该数据集由1,800对问题和参考答案组成，问题需要特别注意。它涵盖了先前英语数据集中建立的广泛风险类别，但是数据样本是原始的，因为它们是手动创建的，以反映日本LLM使用的社会文化背景。我们表明，使用此数据集进行指导来微调日本LLM，从而改善了输出安全性，而不会损害一般响应的实用性。我们还报告了使用此数据集作为基准的12个日本LLM的安全评估结果。最后，我们描述了数据集上的最新更新，该更新提供了有关问题的英文翻译和注释，旨在促进以不同语言和区域的方式推导相似数据集。

Title: Exploring Explanations Improves the Robustness of In-Context Learning

Authors: Ukyo Honda, Tatsushi Oka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02378
Pdf URL: https://arxiv.org/pdf/2506.02378
Copy Paste: [[2506.02378]] Exploring Explanations Improves the Robustness of In-Context Learning(https://arxiv.org/abs/2506.02378)
Keywords: language model, llm
Abstract: In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X$^2$-ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X$^2$-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.
摘要：在利用大型语言模型（LLMS）的成功范式中出现了文化学习（ICL）。但是，它常常努力概括提供的示范的分布。通过解释（X-ICL），ICL的最新进步是通过指导LLM来理解和表达正确标签背后的推理来提高预测可靠性。在这种方法的基础上，我们引入了一个高级框架，该框架通过系统地探索所有可能的标签（x $^2 $ -ICL）的说明来扩展X-ICL，从而实现了更全面和强大的决策。多种自然语言理解数据集的实验结果验证了X $^2 $ -ICL的有效性，与现有的ICL方法相比，对分布数据的鲁棒性有了显着提高。

Title: Consultant Decoding: Yet Another Synergistic Mechanism

Authors: Chuanghao Ding, Jiaping Wang, Ziqing Yang, Xiaoliang Wang, Dahua Lin, Cam-Tu Nguyen, Fei Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02391
Pdf URL: https://arxiv.org/pdf/2506.02391
Copy Paste: [[2506.02391]] Consultant Decoding: Yet Another Synergistic Mechanism(https://arxiv.org/abs/2506.02391)
Keywords: language model, llm
Abstract: The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD. In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). Unlike SD, which relies on a metric derived from importance sampling for verification, CD verifies candidate drafts using token-level likelihoods computed solely by the LLM. CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (around 100% of the target model's performance). Interestingly, this is achieved by combining models whose parameter sizes differ by two orders of magnitude. In addition, CD reduces the call frequency of the large target model to below 10%, particularly in more demanding tasks. CD's performance was even found to surpass that of the large target model, which theoretically represents the upper bound for speculative decoding.
摘要：基于投机解码（SD）的协同机制（SD）作为一种简单而有效的方法，用于加速大型语言模型的推断（LLMS）。但是，高排斥率需要重复的LLMS呼叫来验证草稿令牌，从而破坏了SD的总体效率增长。在这项工作中，我们重新审视了现有的验证机制，并提出了一种新型的协同机制顾问解码（CD）。与SD不同，SD依赖于验证的重要性抽样所得出的度量，CD使用仅由LLM计算的令牌级别的可能性验证了候选草稿。与目标模型相比，CD的推理速度高达2.5倍，同时保持了可比的发电质量（约占目标模型性能的100％）。有趣的是，这是通过组合参数大小不同两个数量级的模型来实现的。此外，CD将大型目标模型的呼叫频率降低到10％以下，尤其是在更苛刻的任务中。甚至发现CD的性能超过了大型目标模型的性能，从理论上讲，该模型代表了用于投机解码的上限。

Title: GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

Authors: Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qianwen Zhang, Di Yin, Xing Sun, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02404
Pdf URL: https://arxiv.org/pdf/2506.02404
Copy Paste: [[2506.02404]] GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation(https://arxiv.org/abs/2506.02404)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: $(i)$ Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. $(ii)$ Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. $(iii)$ Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.
摘要：图形检索增强生成（GraphRag）通过结构组织特定于域特异性的语料库并促进复杂的推理来增强其增强大语模型（LLM）的潜力。但是，当前对GraphRag模型的评估主要依赖于传统的提问数据集。他们在问题和评估指标上的范围有限，无法全面评估GraphRag模型实现的推理能力改进。为了解决此差距，我们介绍了GraphRag Bench，这是一种旨在严格评估GraphRag模型的大型，域特异性基准。我们的基准提供了三个主要优势：\（（i）\）具有挑战性的问题设计。该基准具有要求多跳上推理的大学级别，特定领域的问题，可确保简单的内容检索不足以解决问题。例如，某些问题需要数学推理或编程。 \（（ii）\）多样化的任务覆盖范围。该数据集包括广泛的推理任务，多项选择，True/false，多选，开放式和空白填充。它涵盖了20个核心教科书中的16个学科。 \（（iii）\）整体评估框架。 GraphRag Bench提供了整个GraphRag管道的全面评估，包括图形结构，知识检索和答案生成。除了最终解答的正确性，它还评估了推理过程的逻辑连贯性。通过将九种现代GraphRag方法应用于GraphRag Bench，我们演示了其在量化基于图的结构如何提高模型推理功能的实用性。我们的分析揭示了有关图形体系结构，检索功效和推理功能的关键见解，并为研究界提供了可行的指导。

Title: Gender Inequality in English Textbooks Around the World: an NLP Approach

Authors: Tairan Liu
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2506.02425
Pdf URL: https://arxiv.org/pdf/2506.02425
Copy Paste: [[2506.02425]] Gender Inequality in English Textbooks Around the World: an NLP Approach(https://arxiv.org/abs/2506.02425)
Keywords: language model
Abstract: Textbooks play a critical role in shaping children's understanding of the world. While previous studies have identified gender inequality in individual countries' textbooks, few have examined the issue cross-culturally. This study applies natural language processing methods to quantify gender inequality in English textbooks from 22 countries across 7 cultural spheres. Metrics include character count, firstness (which gender is mentioned first), and TF-IDF word associations by gender. The analysis also identifies gender patterns in proper names appearing in TF-IDF word lists, tests whether large language models can distinguish between gendered word lists, and uses GloVe embeddings to examine how closely keywords associate with each gender. Results show consistent overrepresentation of male characters in terms of count, firstness, and named entities. All regions exhibit gender inequality, with the Latin cultural sphere showing the least disparity.
摘要：教科书在塑造儿童对世界的理解中起着至关重要的作用。尽管以前的研究已经确定了各个国家教科书中的性别不平等，但很少有人在跨文化上研究了这个问题。这项研究采用了自然语言处理方法来量化来自7个文化领域的22个国家的英语教科书中的性别不平等。指标包括角色计数，第一度（首先提到性别）和性别的tf-idf单词关联。该分析还标识了出现在TF-IDF单词列表中的专有名称中的性别模式，测试大型语言模型是否可以区分性别单词列表，并使用手套嵌入来检查与每个性别相关的关键字如何紧密相关。结果表明，在计数，第一度和命名实体方面，男性角色的过分占代表性一致。所有地区都表现出性别不平等，拉丁文化领域的差异最小。

Title: Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

Authors: Maryam Berijanian, Kuldeep Singh, Amin Sehati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02426
Pdf URL: https://arxiv.org/pdf/2506.02426
Copy Paste: [[2506.02426]] Comparative Analysis of AI Agent Architectures for Entity Relationship Classification(https://arxiv.org/abs/2506.02426)
Keywords: language model, llm, prompt, agent
Abstract: Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at \href{this https URL}{this https URL}.
摘要：实体关系分类仍然是信息提取方面的一项具有挑战性的任务，尤其是在标记有限的数据和复杂关系结构的情况下。在这项研究中，我们对旨在使用大语言模型（LLM）进行关系分类的三种不同的AI代理体系结构进行了比较分析。探索的代理体系结构包括（1）反射性自我评估，（2）层次任务分解，以及（3）一种新型的多代理动态示例生成机制，每种都利用了不同的推理模式和及时适应。特别是，我们的动态示例生成方法引入了实时合作和对抗性提示。我们系统地比较了它们在多个域和模型后端的性能。我们的实验表明，多代理协调始终优于标准的少数弹性，并接近了微调模型的性能。这些发现为设计模块化，可推广的LLM基于结构化关系提取的系统提供了实用的指导。源代码和数据集可在\ href {this HTTPS url} {此https url}上获得。

Title: From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

Authors: Mahammed Kamruzzaman, Abdullah Al Monsur, Gene Louis Kim, Anshuman Chhabra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02431
Pdf URL: https://arxiv.org/pdf/2506.02431
Copy Paste: [[2506.02431]] From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models(https://arxiv.org/abs/2506.02431)
Keywords: language model, llm, agent
Abstract: Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.
摘要：情绪是人类经验的基本方面，在个人，文化背景和国籍之间有所不同。鉴于大型语言模型（LLM）最近作为角色扮演代理人的成功，我们检查LLM在分配特定于国籍的角色时是否表现出情感刻板印象。具体而言，我们通过情感归因以及这些归因是否与文化规范相吻合，研究了在预训练的LLM中的不同国家。我们的分析揭示了基于国籍的重大差异，以及诸如羞耻，恐惧和喜悦之类的情绪在各个地区分配不成比例。此外，我们观察到LLM生成和人类情感反应之间的明显未对准，尤其是对于负面情绪，突出了LLM输出中还原性和潜在偏见的刻板印象的存在。

Title: Should LLM Safety Be More Than Refusing Harmful Instructions?

Authors: Utsav Maskey, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02442
Pdf URL: https://arxiv.org/pdf/2506.02442
Copy Paste: [[2506.02442]] Should LLM Safety Be More Than Refusing Harmful Instructions?(https://arxiv.org/abs/2506.02442)
Keywords: language model, llm
Abstract: This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.
摘要：本文介绍了对长尾分布式（加密）文本及其安全含义的大型语言模型（LLM）行为的系统评估。我们引入了一个评估LLM安全性的二维框架：（1）指令拒绝拒绝有害的混淆指令的能力，以及（2）生成安全性 - 抑制产生有害响应。通过全面的实验，我们证明了具有解密密码功能的模型可能容易受到不匹配的将军攻击的影响：它们的安全机制在至少一个安全维度上失败，导致不安全的响应或过度倍增。根据这些发现，我们评估了许多前LLM和后LLM保障措施，并讨论其优势和局限性。这项工作有助于理解LLM在长尾文本方案中的安全性，并为开发强大的安全机制提供了指导。

Title: Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Authors: Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Minfeng Zhu, Bo Zhang, Wei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02454
Pdf URL: https://arxiv.org/pdf/2506.02454
Copy Paste: [[2506.02454]] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework(https://arxiv.org/abs/2506.02454)
Keywords: language model, llm, retrieval augmented generation, agent
Abstract: Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.
摘要：可视化在有效的概念和信息中起着至关重要的作用。推理和检索增强发电的最新进展使大型语言模型（LLMS）能够进行深入的研究并产生全面的报告。尽管取得了进步，但现有的深入研究框架主要集中于生成仅文本内容，而使交织的文本的自动产生和可视化却没有被忽略。这项新颖的任务在设计内容丰富的可视化并有效地将其与文本报告相结合时提出了关键挑战。为了应对这些挑战，我们提出了对可视化的形式描述（FDV），这是一个结构化的文本表示图表，使LLMS能够学习并产生多样化的高质量可视化。在此表示的基础上，我们介绍了多模式DeepResearcher，该框架将任务分解为四个阶段：（1）研究，（2）示例报告文本化，（3）计划和（4）多模式报告生成。为了评估生成的多模式报告，我们开发了多模式Reportbench，其中包含100个不同的主题作为输入以及5个专用指标。跨模型和评估方法的广泛实验证明了多模式深研究员的有效性。值得注意的是，使用相同的Claude 3.7十四行诗模型，多模式Deepresearcher在基线方法上实现了82 \％的总赢率。

Title: MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

Authors: Yupeng Qi, Ziyu Lyu, Min Yang, Yanlin Wang, Lu Bai, Lixin Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02460
Pdf URL: https://arxiv.org/pdf/2506.02460
Copy Paste: [[2506.02460]] MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework(https://arxiv.org/abs/2506.02460)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a \textbf{\underline{Mi}}xture of Experts (MoE) framework for safety-helpfulness \textbf{\underline{d}}ual \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.
摘要：随着大型语言模型（LLM）越来越多地在各个领域中应用，同时增强安全性的同时保持了LLM的帮助已成为一个关键的挑战。最近的研究通过安全受限的在线偏好优化或安全受限的离线偏好优化解决了这一问题。但是，安全受限的在线方法通常会遭受过度安全性，这可能会降低帮助性，而安全受限的离线方法在适应性平衡的安全性和帮助方面的表现较差。为了解决这些局限性，我们提出了MIDPO，A \ TextBf {\下划线{Mi}} Xture（MOE）安全性\ textbf {\ Textbf {\ TextBf {\ textbf {\ pastionline {d}} ual \ textBf {首先，Midpo设计了单个挑战增强的直接偏好优化方法，将基本模型转变为两个独立专家，称为安全性和乐于助人的专家，并进行了两位独立专家，以实现最佳安全性或有益的性能。其次，为了在安全性和帮助之间达到有效的平衡，MIDPO将两位专家纳入MOE框架中，并设计了动态的路由机制，以适应每个专家的贡献。我们在三个流行的数据集上进行定量和定性实验，以证明所提出的MIDPO在安全性和有益性方面都显着优于最先进的方法。代码和模型将发布。

Title: XToM: Exploring the Multilingual Theory of Mind for Large Language Models

Authors: Chunkit Chan, Yauwai Yim, Hongchuan Zeng, Zhiying Zou, Xinyuan Cheng, Zhifan Sun, Zheye Deng, Kawai Chung, Yuzhuo Ao, Yixiang Fan, Cheng Jiayang, Ercong Nie, Ginny Y. Wong, Helmut Schmid, Hinrich Schütze, Simon See, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02461
Pdf URL: https://arxiv.org/pdf/2506.02461
Copy Paste: [[2506.02461]] XToM: Exploring the Multilingual Theory of Mind for Large Language Models(https://arxiv.org/abs/2506.02461)
Keywords: language model, llm
Abstract: Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
摘要：思想理论（汤姆）是在他人中推断精神状态的能力，对于人类的社会认知至关重要。 LLM中TOM的现有评估在很大程度上仅限于英语，忽略了塑造人类认知的语言多样性。这个限制提出了一个关键的问题：LLM可以表现出多语言的心理理论，哪种能力是在各种语言环境中推理精神状态的能力？为了解决这一差距，我们提出了XTOM，这是一种严格验证的多语言基准，可评估五种语言的TOM，并结合了多样化，上下文丰富的任务方案。使用XTOM，我们系统地评估LLM（例如DeepSeek R1），揭示出明显的不和谐：虽然模型在多语言语言理解中表现出色，但它们的TOM性能在语言中会有所不同。我们的发现暴露了LLMS在语言环境中复制类似人类心理的能力中的局限性。

Title: FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging

Authors: Zijian Li, Xiaocheng Feng, Huixin Liu, Yichong Huang, Ting Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02478
Pdf URL: https://arxiv.org/pdf/2506.02478
Copy Paste: [[2506.02478]] FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging(https://arxiv.org/abs/2506.02478)
Keywords: language model
Abstract: With the development of large language models, fine-tuning has emerged as an effective method to enhance performance in specific scenarios by injecting domain-specific knowledge. In this context, model merging techniques provide a solution for fusing knowledge from multiple fine-tuning models by combining their parameters. However, traditional methods often encounter task interference when merging full fine-tuning models, and this problem becomes even more evident in parameter-efficient fine-tuning scenarios. In this paper, we introduce an improvement to the RegMean method, which indirectly leverages the training data to approximate the outputs of the linear layers before and after merging. We propose an adaptive merging method called FroM, which directly measures the model parameters using the Frobenius norm, without any training data. By introducing an additional hyperparameter for control, FroM outperforms baseline methods across various fine-tuning scenarios, alleviating the task interference problem.
摘要：随着大型语言模型的发展，通过注入特定领域的知识来增强特定情况下的性能，成为一种有效的方法。在这种情况下，模型合并技术为通过组合参数从多个微调模型中融合知识提供了解决方案。但是，在合并完整的微调模型时，传统方法通常会遇到任务干扰，并且在参数有效的微调方案中，此问题变得更加明显。在本文中，我们引入了对雷德曼方法的改进，该方法间接利用训练数据来近似合并前后线性层的输出。我们提出了一种自适应合并方法，该方法直接使用Frobenius Norm直接测量模型参数，而无需任何培训数据。通过引入一个额外的高参数以进行控制，从各种微调方案的胜过基线方法中，减轻了任务干扰问题。

Title: ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities

Authors: Yifan Duan, Yihong Tang, Kehai Chen, Liqiang Nie, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02480
Pdf URL: https://arxiv.org/pdf/2506.02480
Copy Paste: [[2506.02480]] ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities(https://arxiv.org/abs/2506.02480)
Keywords: language model, llm, prompt
Abstract: High-quality prompts are crucial for eliciting outstanding performance from large language models (LLMs) on complex tasks. Existing research has explored model-driven strategies for prompt optimization. However, these methods often suffer from high computational overhead or require strong optimization capabilities from the model itself, which limits their broad this http URL address these challenges, we propose ORPP (Optimized Role-Playing Prompt),a framework that enhances model performance by optimizing and generating role-playing prompts. The core idea of ORPP is to confine the prompt search space to role-playing scenarios, thereby fully activating the model's intrinsic capabilities through carefully crafted, high-quality role-playing prompts. Specifically, ORPP first performs iterative optimization on a small subset of training samples to generate high-quality role-playing prompts. Then, leveraging the model's few-shot learning capability, it transfers the optimization experience to efficiently generate suitable prompts for the remaining this http URL experimental results show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance. Notably, ORPP demonstrates superior "plug-and-play" capability. In most cases, it can be integrated with various other prompt methods and further enhance their effectiveness.
摘要：高质量的提示对于从复杂任务中引起大语言模型（LLM）的出色表现至关重要。现有研究探索了模型驱动的策略以迅速优化。但是，这些方法通常会遭受高计算间接费用或需要从模型本身中强大的优化功能，从而限制了其广泛的HTTP URL解决这些挑战，我们提出了ORPP（优化的角色扮演提示），该框架通过优化和生成角色扮演提示来增强模型性能。 ORPP的核心思想是将及时的搜索空间限制在角色扮演的情况下，从而通过精心制作的高质量的角色扮演提示来充分激活模型的内在功能。具体而言，ORPP首先在一小部分训练样本上执行迭代优化，以生成高质量的角色扮演提示。然后，利用该模型的少量学习能力，它转移优化体验，以有效地为剩余的HTTP URL实验结果产生合适的提示，表明ORPP不仅匹配，而且在大多数情况下，就可以在性能方面超过现有的主流及时迅速优化方法。值得注意的是，Orpp展示了出色的“插入式”功能。在大多数情况下，它可以与其他各种迅速方法集成，并进一步提高其有效性。

Title: Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths

Authors: Inderjeet Nair, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02481
Pdf URL: https://arxiv.org/pdf/2506.02481
Copy Paste: [[2506.02481]] Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths(https://arxiv.org/abs/2506.02481)
Keywords: language model, llm
Abstract: Evaluations of LLMs' ethical risks and value inclinations often rely on short-form surveys and psychometric tests, yet real-world use involves long-form, open-ended responses -- leaving value-related risks and preferences in practical settings largely underexplored. In this work, we ask: Do value preferences inferred from short-form tests align with those expressed in long-form outputs? To address this question, we compare value preferences elicited from short-form reactions and long-form responses, varying the number of arguments in the latter to capture users' differing verbosity preferences. Analyzing five LLMs (llama3-8b, gemma2-9b, mistral-7b, qwen2-7b, and olmo-7b), we find (1) a weak correlation between value preferences inferred from short-form and long-form responses across varying argument counts, and (2) similarly weak correlation between preferences derived from any two distinct long-form generation settings. (3) Alignment yields only modest gains in the consistency of value expression. Further, we examine how long-form generation attributes relate to value preferences, finding that argument specificity negatively correlates with preference strength, while representation across scenarios shows a positive correlation. Our findings underscore the need for more robust methods to ensure consistent value expression across diverse applications.
摘要：对LLMS的道德风险和价值倾向的评估通常依赖于短形式的调查和心理测试测试，但是现实世界中的使用涉及长期，开放式的响应 - 在实践环境中留下与价值相关的风险和偏好，在很大程度上尚未充满兴奋。在这项工作中，我们问：从短形式测试与长效输出中表达的相一致的价值偏好是否相符？为了解决这个问题，我们比较了从短形式反应和长期响应中引起的价值偏好，从而改变了后者中参数的数量，以捕获用户的不同词曲偏好。分析五个LLM（LLAMA3-8B，GEMMA2-9B，MISTRAL-7B，QWEN2-7B和OLMO-7B），我们发现（1）从短形式和（2）偏好之间的较弱型号的较弱的较差的较差的较差的较差的相关性中，与任何两者之间的差异相似，从短形式和长形响应中推断出的价值偏好之间的相关性较弱，并且从两种差异的范围中得出了两种差异。（3）对准仅在价值表达的一致性中获得适度的增长。此外，我们检查了长期产生属性与价值偏好相关的多长时间，发现参数特异性与偏好强度有负相关，而各场景的表示形式显示出正相关。我们的发现强调了需要更强大的方法，以确保各种应用程序之间的价值表达一致。

Title: Enhancing Large Language Models with Neurosymbolic Reasoning for Multilingual Tasks

Authors: Sina Bagheri Nezhad, Ameeta Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02483
Pdf URL: https://arxiv.org/pdf/2506.02483
Copy Paste: [[2506.02483]] Enhancing Large Language Models with Neurosymbolic Reasoning for Multilingual Tasks(https://arxiv.org/abs/2506.02483)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) often struggle to perform multi-target reasoning in long-context scenarios where relevant information is scattered across extensive documents. To address this challenge, we introduce NeuroSymbolic Augmented Reasoning (NSAR), which combines the benefits of neural and symbolic reasoning during inference. NSAR explicitly extracts symbolic facts from text and generates executable Python code to handle complex reasoning steps. Through extensive experiments across seven languages and diverse context lengths, we demonstrate that NSAR significantly outperforms both a vanilla RAG baseline and advanced prompting strategies in accurately identifying and synthesizing multiple pieces of information. Our results highlight the effectiveness of combining explicit symbolic operations with neural inference for robust, interpretable, and scalable reasoning in multilingual settings.
摘要：大型语言模型（LLMS）通常在长篇小说方案中进行多目标推理，在这些方案中相关信息散布在广泛的文档中。为了应对这一挑战，我们引入了神经肯定增强推理（NSAR），该推理结合了推理过程中神经和象征性推理的益处。 NSAR从文本中明确提取符号事实，并生成可执行的Python代码以处理复杂的推理步骤。通过跨七种语言和不同上下文长度的大量实验，我们证明了NSAR的表现明显优于香草抹布基线和高级提示策略，以准确识别和综合多个信息。我们的结果突出了将显式符号操作与神经推断相结合的有效性，以在多语种环境中进行鲁棒，可解释和可扩展的推理。

Title: Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

Authors: Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, Xiaojun Wan
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02494
Pdf URL: https://arxiv.org/pdf/2506.02494
Copy Paste: [[2506.02494]] Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text(https://arxiv.org/abs/2506.02494)
Keywords: gpt, llm
Abstract: Evaluation is important for multimodal generation tasks. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing work overlooks two aspects: (1) the development of evaluation capabilities for text-to-image (T2I) generation task, and (2) the incorporation of large-scale human evaluation data. In this paper, we introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT. The corpus contains evaluation data across both image-to-text(I2T) and T2I generation tasks. Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos, a multimodal evaluation model built upon a 7B backbone. Minos achieves state-of-the-art (SoTA) performance among all open-source evaluation models of similar scale on the average of evaluation performance on all tasks, and outperforms all open-source and closed-source models on evaluation of T2I generation task. Extensive experiments demonstrate the importance of leveraging high-quality human evaluation data and jointly training on evaluation data from both I2T and T2I generation tasks.
摘要：评估对于多模式生成任务很重要。随着MLLM的快速发展，对应用MLLM建立一般评估系统的兴趣越来越大。但是，现有工作忽略了两个方面：（1）文本对图像（T2I）生成任务的评估功能的发展，以及（2）合并大型人类评估数据。在本文中，我们介绍了Minos-Corpus，这是一个大规模的多模式评估数据集，结合了来自人类和GPT的评估数据。该语料库包含图像到文本（I2T）和T2I生成任务的评估数据。基于此语料库，我们提出了数据选择和平衡，混合SFT训练方法，并应用DPO开发MINOS，这是建立在7B骨架上的多模式评估模型。 MINOS在所有任务上的平均评估绩效的所有开源评估模型中都实现了最新的（SOTA）性能，并且在评估T2I生成任务上，所有任务的评估绩效平均表现都优于所有开源和封闭式模型。广泛的实验表明，利用高质量的人类评估数据以及对I2T和T2I生成任务的评估数据的共同培训的重要性。

Title: KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG

Authors: Yongjian Li, HaoCheng Chu, Yukun Yan, Zhenghao Liu, Shi Yu, Zheni Zeng, Ruobing Wang, Sen Song, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02503
Pdf URL: https://arxiv.org/pdf/2506.02503
Copy Paste: [[2506.02503]] KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG(https://arxiv.org/abs/2506.02503)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.
摘要：检索增强的生成（RAG）使大型语言模型（LLMS）能够访问更广泛的知识来源，但由于在检索文档中的噪音，即使使用先进的检索方法，事实上的不一致之处仍然存在。我们证明，增强生成模型处理嘈杂内容的能力对于稳健的性能同样至关重要。在本文中，我们介绍了kare-rag（知识意识的完善和增强RAG），通过三个关键创新来改善知识的利用：（1）结构化知识表示，促进培训期间促进错误检测的结构性知识表征，（2）密集的直接偏好优化优化（DDPO）（DDPO） - 精制训练目标，以确保对临界错误和（3）对比的纠正（3）对比的纠正（3），并且（3）对形成了对比的情况（3）。不准确。实验表明，我们的方法显着增强了跨模型量表的标准抹布管道，从而在不损害一般能力的情况下改善了内域和室外任务性能。值得注意的是，这些收益是通过适度的培训数据实现的，这表明通过有针对性的学习策略可以实现数据有效的优化。我们的发现为抹布改进建立了一个新的方向：通过改进模型学习处理检索内容的方式，我们可以提高各种推理范式的性能。所有数据和代码将在GitHub上公开可用。

Title: M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

Authors: Jie Zhu, Junhui Li, Yalong Wen, Xiandong Li, Lifan Guo, Feng Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02510
Pdf URL: https://arxiv.org/pdf/2506.02510
Copy Paste: [[2506.02510]] M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset(https://arxiv.org/abs/2506.02510)
Keywords: language model, llm
Abstract: Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $\texttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $\texttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $\texttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $\texttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs' financial meeting comprehension skills.
摘要：大型语言模型（LLM）的最新突破已导致开发新的基准，以评估其在金融领域的绩效。但是，当前的财务基准通常依赖新闻文章，收益报告或公告，这使得捕捉财务会议的现实世界动态变得具有挑战性。为了解决这一差距，我们提出了一个新颖的基准标准，称为$ \ texttt {m $^3 $ finmeeting} $，这是一个多语言，多部门和多任务数据集，设计用于财务会议的理解。首先，$ \ texttt {m $^3 $ finmeeting} $支持英语，中文和日语，增强了对各种语言背景下的财务讨论的理解。其次，它涵盖了由全球行业分类标准（GIC）定义的各个行业领域，以确保基准跨越广泛的财务活动。最后，$ \ texttt {m $^3 $ finmeeting} $包含三个任务：摘要，问题 - 答案（QA）对提取和问题答案，并促进了对理解的更现实，更全面的评估。七个流行的LLMS的实验结果表明，即使是最先进的长篇小说模型也具有重大改进的空间，证明了$ \ texttt {m $^3 $ finmeeting} $的有效性，作为评估LLMS金融会议理解技能的基准。

Title: FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Authors: Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi Georgiev, Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh, Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev, Tanmoy Chakraborty, Salem Lahlou, Veselin Stoyanov, Preslav Nakov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02515
Pdf URL: https://arxiv.org/pdf/2506.02515
Copy Paste: [[2506.02515]] FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning(https://arxiv.org/abs/2506.02515)
Keywords: llm, chain-of-thought
Abstract: Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.
摘要：多步象征性推理对于在财务任务上的下游绩效至关重要。然而，缺乏用于系统评估此能力的基准。诸如FinQA和Convinqa之类的现有数据集仅监督最终的数值答案，而无需评估中间推理步骤。为了解决这个问题，我们介绍了Finchain，这是第一个符合可验证思想链（COT）财务推理的符号基准。 Fin Chain跨越了12个金融领域的54个主题，每个主题提供了5个参数化模板，每个模板都在推理复杂性和所需的域专业知识上有所不同。每个数据集实例都包括可执行的Python跟踪，可以自动生成广泛的培训数据，并轻松适应其他域。我们还引入了ChaineVal，这是一种新的指标，用于自动评估最终答案和中间推理。通过在我们的数据集上进行30个LLM的基准测试，我们发现即使是最先进的模型也有相当大的改进多步财务推理的空间。所有模板和雀链的评估指标均可在https：//github.com/mbzuai-nlp/finchain上获得。

Title: Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning

Authors: Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, Balaji Krishnamurthy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02519
Pdf URL: https://arxiv.org/pdf/2506.02519
Copy Paste: [[2506.02519]] Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning(https://arxiv.org/abs/2506.02519)
Keywords: gpt, llm, prompt
Abstract: LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (this https URL).
摘要：llmssuch作为GPT-4，通过产生逐步的理由来解决复杂问题的显着能力。先前的工作利用了这种能力来改善较小和便宜的LMS（例如，具有7B参数）。但是，由于在大型（经常关闭）模型的预训练数据中缺乏透明度，因此可以阻止其在商业环境中的使用，例如版权和法律问题，例如版权和法律问题。在不从较大的LLM中提取信息的情况下，很少有人重点提高较小模型的先天推理能力。为了解决这个问题，我们提出了整理的建议，这是一个可训练的框架，可调整A（小）LLM，以从有选择地改善下游任务的各种理由池中产生这些输出。整理强制执行相同LLM的多个实例以表现出不同的行为，并采用它们来产生理由以获得各种产出。然后，通过优先优化对LLM进行调整，以选择候选理由，从而最大程度地提高了基本真相答案的可能性。整理在3个域上的5个数据集上胜过几个可训练的基线：数学解决问题，自然语言推断和常识性推理。我们在不同模型家族（1b至8b）上从不同模型家族的LLM上进行了分解的效果，并证明了通过消融最终任务指导的多个理性提供者的好处。代码在此处发布（此HTTPS URL）。

Title: Answer Convergence as a Signal for Early Stopping in Reasoning

Authors: Xin Liu, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02536
Pdf URL: https://arxiv.org/pdf/2506.02536
Copy Paste: [[2506.02536]] Answer Convergence as a Signal for Early Stopping in Reasoning(https://arxiv.org/abs/2506.02536)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to examine what is the minimum reasoning required for a model to reach a stable decision. We find that on math reasoning tasks like math, models typically converge to their final answers after 60\% of the reasoning steps, suggesting substantial redundancy in the remaining content. Based on these insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods significantly reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40\% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.
摘要：经过思考链（COT）提示在大语言模型（LLMS）中增强了推理，但通常会导致冗长和冗余的产出，从而增加推理成本。我们假设许多推理步骤对于产生正确的答案是不必要的。为了调查这一点，我们从系统的研究开始，以检查模型实现稳定决定所需的最低推理。我们发现，在数学推理任务中，模型通常会在60 \％的推理步骤之后收敛到最终答案，这表明其余内容中有很大的冗余。基于这些见解，我们提出了提高效率的三种推理时间策略：（1）通过答案一致性提早停止，（2）提高产生反应信号的可能性，以及（3）一种有监督的方法，该方法学会了何时基于内部激活来停止。五个基准和五个开放量LLMS的实验表明，我们的方法大大降低了令牌使用情况，几乎没有准确性下降。特别是，在自然要求上，答案一致性将令牌降低了40 \％，同时进一步提高了准确性。我们的工作强调了在推理时运行具有成本效益的推理方法的重要性，从而为现实世界应用提供了实际收益。

Title: CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

Authors: Yang Tian, Fan Liu, Jingyuan Zhang, Victoria W., Yupeng Hu, Liqiang Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02544
Pdf URL: https://arxiv.org/pdf/2506.02544
Copy Paste: [[2506.02544]] CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG(https://arxiv.org/abs/2506.02544)
Keywords: language model, retrieval-augmented generation
Abstract: Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose \textbf{C}r\textbf{o}ss-source knowledge \textbf{Re}conciliation for \textbf{M}ulti\textbf{M}odal \textbf{RAG} (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6\% and 9.3\% performance gains on InfoSeek and Encyclopedic-VQA, respectively. We release code and data at \href{this https URL}{this https URL}.
摘要：已经引入了多模式检索增强的生成（MMRAG），以通过合并外部检索的多模式知识来增强多模式大语模型，但它引入了两个挑战：参数退还的知识不一致（PRKI）（PRKI），在该知识之间创造不确定的知识之间的差异，并在确定知识之间差异，并在确定可靠性的地方，并验证了视觉性，视觉范围的范围，并在视觉上差异。视觉和文本源之间的未对准破坏了实体表示。为了应对这些挑战，我们提出\ textbf {c} r \ textbf {o} ss-source知识\ textbf {re} conciliation \ textbf {m} ulti \ textbf {m} Core-MMRAG遵循四阶段的管道：它首先从参数知识中产生内部响应，然后通过联合相似性评估选择最相关的多模式证据，产生外部响应，并最终整合两者以产生可靠的答案。此外，专门的培训范式可以增强知识源歧视，多模式整合和统一的答案。 KB-VQA基准测试的实验表明，核心MMRAG比基线方法实现了实质性改进，分别在Infoseek和Infoseek和Infoseek和百科全书-VQA上实现了5.6 \％和9.3 \％的性能提高。我们以\ href {this HTTPS url} {此https url}发布代码和数据。

Title: Pruning General Large Language Models into Customized Expert Models

Authors: Yirao Zhao, Guizhen Chen, Kenji Kawaguchi, Lidong Bing, Wenxuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02561
Pdf URL: https://arxiv.org/pdf/2506.02561
Copy Paste: [[2506.02561]] Pruning General Large Language Models into Customized Expert Models(https://arxiv.org/abs/2506.02561)
Keywords: language model, llm
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need compact expert models tailored to specific downstream scenarios. However, most existing pruning methods focus on preserving the model's general capabilities, often requiring extensive post-training or suffering from degraded performance due to coarse-grained pruning. In this work, we design a $\underline{Cus}$tom $\underline{Prun}$ing method ($\texttt{Cus-Prun}$) to prune a large general model into a smaller lightweight expert model, which is positioned along the "language", "domain" and "task" dimensions. By identifying and pruning irrelevant neurons of each dimension, $\texttt{Cus-Prun}$ creates expert models without any post-training. Our experiments demonstrate that $\texttt{Cus-Prun}$ consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.
摘要：大型语言模型（LLMS）彻底改变了自然语言处理，但是它们的实质性模型通常需要大量的计算资源。为了保留计算资源并加速推理速度，对于修剪冗余参数至关重要，特别是对于经验丰富的用户，他们通常需要针对特定下游方案的紧凑型专家模型。但是，大多数现有的修剪方法都集中在保留该模型的一般能力，通常需要大量的培训后或由于粗粒修剪而导致的性能退化。在这项工作中，我们设计了一个$ \ underline {cus} $ tom $ \ undesline {prun} $ ing方法（$ \ texttt {cus-prun} $），将大型通用模型修剪成一个较小的轻质专家模型，该模型沿着“语言”，“域”和“任务”尺寸定位。通过识别和修剪每个维度无关的神经元，$ \ texttt {cus-prun} $创建了专家模型，而无需任何后培训。我们的实验表明，$ \ texttt {cus-prun} $始终胜过其他方法，从不同模型族和尺寸的各种模型中的专家和一般能力中实现了最小的损失。

Title: IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages

Authors: Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02573
Pdf URL: https://arxiv.org/pdf/2506.02573
Copy Paste: [[2506.02573]] IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages(https://arxiv.org/abs/2506.02573)
Keywords: language model, llm
Abstract: Although region-specific large language models (LLMs) are increasingly developed, their safety remains underexplored, particularly in culturally diverse settings like Indonesia, where sensitivity to local norms is essential and highly valued by the community. In this work, we present IndoSafety, the first high-quality, human-verified safety evaluation dataset tailored for the Indonesian context, covering five language varieties: formal and colloquial Indonesian, along with three major local languages: Javanese, Sundanese, and Minangkabau. IndoSafety is constructed by extending prior safety frameworks to develop a taxonomy that captures Indonesia's sociocultural context. We find that existing Indonesian-centric LLMs often generate unsafe outputs, particularly in colloquial and local language settings, while fine-tuning on IndoSafety significantly improves safety while preserving task performance. Our work highlights the critical need for culturally grounded safety evaluation and provides a concrete step toward responsible LLM deployment in multilingual settings. Warning: This paper contains example data that may be offensive, harmful, or biased.
摘要：尽管特定于地区的大型语言模型（LLM）越来越开发，但它们的安全性仍未得到充实，尤其是在像印度尼西亚这样的文化多样性的环境中，对当地规范的敏感性至关重要，并且受社区的高度评价。在这项工作中，我们介绍了印度安全的印度安全，这是针对印尼语境量身定制的第一个高质量的，人为验证的安全评估数据集，涵盖了五种语言品种：正式和口语印尼语，以及三种主要的当地语言：爪哇，圣达尼亚人，圣达尼亚人和米南卡巴（Minangkabau）。印度安全是通过扩展先前的安全框架来开发捕获印度尼西亚社会文化背景的分类法来构建的。我们发现，现有的以印尼为中心的LLM通常会产生不安全的输出，尤其是在口语和本地语言设置中，同时对印尼的微调可显着提高安全性，同时保持任务绩效。我们的工作突出了对文化扎根的安全评估的迫切需求，并为在多语言环境中迈出了负责任的LLM部署的具体步骤。警告：本文包含可能令人反感，有害或有偏见的示例数据。

Title: Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM

Authors: Maria Levchenko
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.02589
Pdf URL: https://arxiv.org/pdf/2506.02589
Copy Paste: [[2506.02589]] Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM(https://arxiv.org/abs/2506.02589)
Keywords: language model, gpt, llm, prompt
Abstract: This paper addresses the challenge of Named Entity Recognition (NER) for person names within the specialized domain of Russian news texts concerning cultural events. The study utilizes the unique SPbLitGuide dataset, a collection of event announcements from Saint Petersburg spanning 1999 to 2019. A comparative evaluation of diverse NER models is presented, encompassing established transformer-based architectures such as DeepPavlov, RoBERTa, and SpaCy, alongside recent Large Language Models (LLMs) including GPT-3.5, GPT-4, and GPT-4o. Key findings highlight the superior performance of GPT-4o when provided with specific prompting for JSON output, achieving an F1 score of 0.93. Furthermore, GPT-4 demonstrated the highest precision at 0.99. The research contributes to a deeper understanding of current NER model capabilities and limitations when applied to morphologically rich languages like Russian within the cultural heritage domain, offering insights for researchers and practitioners. Follow-up evaluation with GPT-4.1 (April 2025) achieves F1=0.94 for both simple and structured prompts, demonstrating rapid progress across model families and simplified deployment requirements.
摘要：本文解决了俄罗斯新闻文本有关文化事件的专业领域中指定实体识别（NER）的挑战。该研究利用了独特的Spblitguide数据集，这是从1999年至2019年的Saint Petersburg跨越的活动公告集合。对不同的模型进行了比较评估，包括既定的基于变形金刚的结构，包括Deeppavlov，Roberta，Roberta和Spacy，例如最近的大型语言模型（包括GPT-4.5），包括GPT-4.5，包括GPT-33.5，包括GPT-4.5，包括GPT-4.4，关键发现突出了GPT-4O的出色性能，并为JSON输出提供了特定的提示，其F1得分为0.93。此外，GPT-4在0.99时表现出最高的精度。当应用于文化遗产领域中的俄语（如俄语）时，这项研究有助于更深入地了解当前的NER模型能力和局限性，为研究人员和从业者提供见解。使用GPT-4.1（2025年4月）的后续评估，对于简单和结构化的提示都可以实现F1 = 0.94，这表明了模型家族之间的快速进步和简化的部署要求。

Title: On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

Authors: Minh Duc Bui, Kyung Eun Park, Goran Glavaš, Fabian David Schmidt, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02591
Pdf URL: https://arxiv.org/pdf/2506.02591
Copy Paste: [[2506.02591]] On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures(https://arxiv.org/abs/2506.02591)
Keywords: language model, llm, chain-of-thought
Abstract: Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs' answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.
摘要：测量系统（例如，货币）在各种文化之间有所不同，但是它们之间的转换良好，以便人类可以使用其选择的任何测量系统来陈述事实。大型语言模型（LLM）也可供不同文化背景的用户使用，无论手头测量系统如何，都应该能够提供准确的信息。使用新编译的数据集，我们测试了七个开源LLM的情况，解决了三个关键的研究问题：（RQ1）LLMS用于每种测量类型的默认系统是什么？（RQ2）LLMS的答案及其准确性在不同的测量系统中有所不同吗？（RQ3）LLM可以缓解潜在的挑战W.R.T.通过推理代表性不足的系统？我们的发现表明，LLMS默认为数据中主要使用的测量系统。此外，我们观察到了不同测量系统的绩效的巨大不稳定和差异。尽管这种不稳定性可以部分通过采用诸如思想链（COT）之类的推理方法来缓解，但这意味着较长的响应，从而大大增加了测试时间计算（和推理成本），从而使使用不足以说明的测量系统的文化背景的用户边缘化。

Title: Beyond the Surface: Measuring Self-Preference in LLM Judgments

Authors: Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02592
Pdf URL: https://arxiv.org/pdf/2506.02592
Copy Paste: [[2506.02592]] Beyond the Surface: Measuring Self-Preference in LLM Judgments(https://arxiv.org/abs/2506.02592)
Keywords: language model, llm
Abstract: Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at this https URL.
摘要：最近的研究表明，大型语言模型（LLMS）在担任法官时表现出自我的偏见，这意味着他们倾向于比其他模型产生的反应相比，倾向于自己的反应。现有方法通常通过计算法官模型分配给其自身响应的分数之间的差异以及分配给其他模型的响应的分数之间的差异来衡量这种偏见。但是，这种方法将自我偏爱偏见与响应质量混为一谈，因为法官模型的较高质量响应也可能导致积极得分差异，即使在没有偏见的情况下也是如此。为了解决这个问题，我们将黄金判断作为代理响应质量的代理，并提出了DBG分数，该分数将自我挑战偏见作为法官模型分配的分数与其自身响应和相应的黄金判断之间的差异。由于黄金判断反映了真正的响应质量，因此DBG得分减轻了响应质量对偏差测量的混杂作用。使用DBG分数，我们进行了全面的实验，以评估各种版本，大小和推理能力的LLM之间的自我偏差偏差。此外，我们研究了影响和有助于减轻自我挑战偏见的两个因素：响应文本样式和法官模型的培训后数据。最后，我们从基于注意力的角度探讨了自我偏爱偏见的潜在潜在机制。我们的代码和数据可在此HTTPS URL上找到。

Title: EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

Authors: Fan Gao, Dongyuan Li, Ding Xia, Fei Mi, Yasheng Wang, Lifeng Shang, Baojun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02596
Pdf URL: https://arxiv.org/pdf/2506.02596
Copy Paste: [[2506.02596]] EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing(https://arxiv.org/abs/2506.02596)
Keywords: language model, llm, prompt
Abstract: Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.
摘要：中国的论文写作及其评估在教育环境中至关重要，但是该领域中大语言模型（LLM）的能力仍然很大程度上尚未得到充实。现有的基准通常依赖于粗粒的文本质量指标，在很大程度上忽略了中国论文的结构和修辞复杂性，尤其是各种流派。为了解决这一差距，我们提出了\ benchname，这是一种专门为中国论文写作而设计的多流派基准：论证，叙事，描述性和说明性。我们总共策划并完善了728个现实世界的提示，以确保真实性并精心将其分类为\ textit {open-enden}和\ textit {Condrested}集合以捕获各种写作方案。为了可靠地评估生成的论文，我们开发了一个细粒度，特定流派的评分框架，该框架可以分层汇总分数。我们通过一项全面的人类协议研究进一步验证我们的评估方案。最后，我们基准了15个大型LLM，分析了它们在类型和教学类型之间的优势和局限性。使用\ benchname，我们旨在推进基于LLM的中国论文评估，并激发未来在教育环境中提高论文产生的研究。

Title: Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs

Authors: Manon Reusens, Bart Baesens, David Jurgens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02659
Pdf URL: https://arxiv.org/pdf/2506.02659
Copy Paste: [[2506.02659]] Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs(https://arxiv.org/abs/2506.02659)
Keywords: language model, llm
Abstract: Personalized Large Language Models (LLMs) are increasingly used in diverse applications, where they are assigned a specific persona - such as a happy high school teacher - to guide their responses. While prior research has examined how well LLMs adhere to predefined personas in writing style, a comprehensive analysis of consistency across different personas and task types is lacking. In this paper, we introduce a new standardized framework to analyze consistency in persona-assigned LLMs. We define consistency as the extent to which a model maintains coherent responses when assigned the same persona across different tasks and runs. Our framework evaluates personas across four different categories (happiness, occupation, personality, and political stance) spanning multiple task dimensions (survey writing, essay generation, social media post generation, single turn, and multi-turn conversations). Our findings reveal that consistency is influenced by multiple factors, including the assigned persona, stereotypes, and model design choices. Consistency also varies across tasks, increasing with more structured tasks and additional context. All code is available on GitHub.
摘要：个性化的大语言模型（LLM）越来越多地用于不同的应用程序中，在这些应用程序中，他们被分配了一个特定的角色（例如快乐的高中老师）来指导他们的回答。虽然先前的研究已经检查了LLMS如何以书面形式遵守预定义的角色，但缺乏对不同角色和任务类型的一致性的全面分析。在本文中，我们引入了一个新的标准化框架，以分析角色分配的LLMS的一致性。我们将一致性定义为模型在各个任务和运行中分配相同角色时保持连贯响应的程度。我们的框架评估了跨越多个任务维度的四个不同类别（幸福，职业，个性和政治立场）的角色（调查写作，论文产生，社交媒体后期，单转和多转交谈）。我们的发现表明，一致性受多种因素的影响，包括指定的角色，刻板印象和模型设计选择。一致性在任务中也有所不同，随着更具结构化的任务和其他上下文而增加。所有代码均可在GitHub上找到。

Title: EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Authors: Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Wei Chengzhi, Lin Yan, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02672
Pdf URL: https://arxiv.org/pdf/2506.02672
Copy Paste: [[2506.02672]] EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving(https://arxiv.org/abs/2506.02672)
Keywords: language model, llm
Abstract: We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.
摘要：我们介绍了旨在评估大型语言模型（LLM）的开创性基准测试，旨在评估其在挑战任务中的学习能力和效率，这是模型潜力的关键但毫无疑问的方面。 AreadeArn包含648个跨越六个任务类型的具有挑战性的问题，分为182个序列，每个序列专用于一种任务类型。与并行评估模型的大多数现有基准分歧，evalearn需要模型来依次解决问题，从而使它们能够利用从以前的解决方案中获得的经验。 Evalearn提供了五个全面的自动指标，以评估模型并量化其学习能力和效率。我们广泛基于九个边界模型，并观察到各种各样的性能概况：某些模型，例如Claude-3.7-Sonnet，从中等初始性能开始，但具有强大的学习能力，而某些模型则努力从经验中受益，甚至可能显示出负面的转移。此外，我们研究了两个学习设置下的模型性能，并发现实例级的专栏和教师模型反馈进一步促进了模型学习。重要的是，我们观察到，当前具有更强静态能力的LLM在所有任务中学习能力方面并没有明显优势，这突出了Isalearn评估模型性能的新维度。我们希望Areadearn提供了一种新颖的评估观点，用于评估LLM潜力并了解模型和人类能力之间的差距，从而促进更深入，更动态的评估方法的发展。所有数据集，自动评估框架以及本文研究的结果均可在GitHub存储库中获得。

Title: TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression

Authors: Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Ying Nian Wu, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu
Subjects: cs.CL, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2506.02678
Pdf URL: https://arxiv.org/pdf/2506.02678
Copy Paste: [[2506.02678]] TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression(https://arxiv.org/abs/2506.02678)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.
摘要：大型语言模型（LLMS）最近通过利用强化学习和扩展思想链（COT）技术取得了显着的进步。但是，执行有效的语言推理的挑战 - 尤其是在推断出非常长的输出期间 - 引起了研究社区的越来越多的关注。在这项工作中，我们提出了一个基于动态比率的训练管道，该管道不依赖于多个模型之间的复杂数据注释或插值。我们不断平衡模型的System-1和System-2数据之间的权重，以消除冗余推理过程，同时保留模型的推理能力。我们在DeepSeek-R1-Distill-7B和DeepSeek-R1-Distill-14B以及各种难度水平不同的基准组上验证了我们的方法。我们的方法在维持推理的准确性的同时，将输出令牌的数量大大减少了近40％。我们的代码和数据将很快可用。

Title: Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints

Authors: Zhengdong Lu, Weikai Lu, Yiling Tao, Yun Dai, ZiXuan Chen, Huiping Zhuang, Cen Chen, Hao Peng, Ziqian Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02683
Pdf URL: https://arxiv.org/pdf/2506.02683
Copy Paste: [[2506.02683]] Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints(https://arxiv.org/abs/2506.02683)
Keywords: language model, llm, agent
Abstract: Despite significant advances in Large Language Models (LLMs), planning tasks still present challenges for LLM-based agents. Existing planning methods face two key limitations: heavy constraints and cascading errors. To address these limitations, we propose a novel parallel planning paradigm, which Decomposes, Plans for subtasks in Parallel, and Merges subplans into a final plan (DPPM). Specifically, DPPM decomposes the complex task based on constraints into subtasks, generates the subplan for each subtask in parallel, and merges them into a global plan. In addition, our approach incorporates a verification and refinement module, enabling error correction and conflict resolution. Experimental results demonstrate that DPPM significantly outperforms existing methods in travel planning tasks.
摘要：尽管大型语言模型（LLM）取得了重大进展，但计划任务仍然给基于LLM的代理带来挑战。现有的计划方法面临两个关键局限性：严重的限制和级联错误。为了解决这些局限性，我们提出了一个新颖的并行计划范式，该范式分解，并行的子任务计划，并将子计划合并为最终计划（DPPM）。具体而言，DPPM根据约束将复杂的任务分解为子任务，并并行生成每个子任务的子计划，并将其合并为全局计划。此外，我们的方法包含了验证和改进模块，从而实现了错误纠正和解决冲突。实验结果表明，DPPM在旅行计划任务中明显胜过现有方法。

Title: MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

Authors: Liang Yue, Yihong Tang, Kehai Chen, Jie Liu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02689
Pdf URL: https://arxiv.org/pdf/2506.02689
Copy Paste: [[2506.02689]] MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching(https://arxiv.org/abs/2506.02689)
Keywords: language model, agent
Abstract: Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models' instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2.5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.
摘要：指导微调在NLP任务中至关重要，从而增强了预验证的模型跟踪功能和特定于任务的性能。但是，由于数据收集困难和高生产成本，获得大型模型的高质量微调数据是具有挑战性的。为了解决这个问题，我们提出了一种新的数据增强方法，该方法通过具有不同认知水平的多个代理之间的相互作用来丰富原始数据。我们模拟了三个教学扎根的教学场景，利用多代理对话来生成高质量的教师互动数据。利用主人，我们构建了BOOST-QA，这是一个从Orca-Math-200k，ProcQa和OpenHermes2.5等现有数据集增加的微调数据集。实验表明，用Boost-QA微调的模型在多个基准测试中表现出色，表明多任务概括。值得注意的是，大师显着提高了模型在复杂任务中的推理能力，从而为未来的研究提供了宝贵的见解。

Title: On Entity Identification in Language Models

Authors: Masaki Sakata, Sho Yokoi, Benjamin Heinzerling, Takumi Ito, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02701
Pdf URL: https://arxiv.org/pdf/2506.02701
Copy Paste: [[2506.02701]] On Entity Identification in Language Models(https://arxiv.org/abs/2506.02701)
Keywords: language model
Abstract: We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions -- ambiguity and variability -- and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.
摘要：我们分析语言模型（LMS）的内部表示识别和区分指定实体的提及的程度，重点是实体及其提及之间的多一对一的对应关系。我们首先提出了实体提及的两个问题 - 歧义和可变性 - 并提出了一个类似于聚类质量指标的框架。具体而言，我们通过对LM内部表示的群集分析来量化同一实体群集在一起的程度以及对不同实体的提及保持分离。我们的实验检查了五个基于变压器的自回归模型，表明它们有效地识别和区分了类似于精度的指标的实体，回忆范围为0.66至0.9。进一步的分析表明，与实体相关的信息在早期LM层的低维线性子空间中被紧密表示。此外，我们阐明了实体表示的特征如何影响单词预测性能。这些发现是通过LM表示与实体以实体知识结构之间的同构晶格来解释的，从而提供了有关LMS内部组织和使用实体信息的见解。

Title: RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

Authors: Qihang Yan, Xinyu Zhang, Luming Guo, Qi Zhang, Feifan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02726
Pdf URL: https://arxiv.org/pdf/2506.02726
Copy Paste: [[2506.02726]] RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models(https://arxiv.org/abs/2506.02726)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs' knowledge application, reasoning reliability, and process transparency in complex vertical domains.
摘要：大型语言模型（LLMS）在垂直领域的准确性，特定于领域的推理和可解释性斗争。传统的偏好一致性方法，例如从人类反馈（RLHF）学习的强化学习和直接偏好优化（DPO）通常会忽略潜在的知识来源和推理逻辑。本文介绍了种族主义者（检索仪式和经过思考的增强的对齐方式），这是一个旨在解决这些限制的新型框架。种族合并系统地构建了包含外部知识支持和明确的思想链（COT）推理的二进制偏好数据集，然后使用DPO算法对LLM对齐。核心创新在于其偏好数据构建策略：它整合了AI驱动的检索，以进行事实接地，增强知识态度和准确性，并强调特定于域特异性COT的优化，将推理过程本身视为关键偏好维度。多阶段，AI驱动的完善管道成本有效地生成这些偏好对。使用QWEN3-1.7B作为基本模型的传统中药（TCM）实验验证表明，种族一致性明显优于原始基本模型，并且仅通过监督的微调（SFT）进行微调。在多个维度上观察到了改进，包括答案准确性，信息丰富度，TCM思维模式的应用，逻辑性和推理深度以及可解释性。这些发现表明，Race-Align提供了一种有效的途径，以增强LLM的知识应用，推理可靠性以及复杂垂直域中的过程透明度。

Title: Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs

Authors: Stefano Bannò, Kate Knill, Mark Gales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02758
Pdf URL: https://arxiv.org/pdf/2506.02758
Copy Paste: [[2506.02758]] Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs(https://arxiv.org/abs/2506.02758)
Keywords: language model, llm
Abstract: Vocabulary use is a fundamental aspect of second language (L2) proficiency. To date, its assessment by automated systems has typically examined the context-independent, or part-of-speech (PoS) related use of words. This paper introduces a novel approach to enable fine-grained vocabulary evaluation exploiting the precise use of words within a sentence. The scheme combines large language models (LLMs) with the English Vocabulary Profile (EVP). The EVP is a standard lexical resource that enables in-context vocabulary use to be linked with proficiency level. We evaluate the ability of LLMs to assign proficiency levels to individual words as they appear in L2 learner writing, addressing key challenges such as polysemy, contextual variation, and multi-word expressions. We compare LLMs to a PoS-based baseline. LLMs appear to exploit additional semantic information that yields improved performance. We also explore correlations between word-level proficiency and essay-level proficiency. Finally, the approach is applied to examine the consistency of the EVP proficiency levels. Results show that LLMs are well-suited for the task of vocabulary assessment.
摘要：词汇使用是第二语言（L2）熟练度的基本方面。迄今为止，自动化系统的评估通常已经检查了与上下文无关或词性（POS）相关的单词使用。本文介绍了一种新颖的方法，以实现细粒度的词汇评估，从而利用句子在句子中的精确使用。该方案将大型语言模型（LLM）与英文词汇概况（EVP）结合在一起。 EVP是一种标准的词汇资源，可以使内在词汇用途与熟练程度相关联。我们评估了LLM在L2学习者写作中出现的单个单词分配能力水平的能力，以应对诸如多义，上下文变化和多字表达等关键挑战。我们将LLMS与基于POS的基线进行比较。 LLM似乎利用了其他语义信息，从而提高了性能。我们还探讨了单词级水平和论文级水平之间的相关性。最后，采用该方法来检查EVP能力水平的一致性。结果表明，LLM非常适合词汇评估的任务。

Title: SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

Authors: Sifan Li, Yujun Cai, Yiwei Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02803
Pdf URL: https://arxiv.org/pdf/2506.02803
Copy Paste: [[2506.02803]] SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking(https://arxiv.org/abs/2506.02803)
Keywords: language model, prompt
Abstract: Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
摘要：视觉语言模型（VLMS）在语义任务中表现出色，但在人类核心能力上摇摇欲坠：通过缩放等感知调整来检测错觉或AI生成的图像中的隐藏内容。我们介绍了HC Bench，这是112张带有隐藏文本，对象和幻觉的图像的基准，表明领先的VLMS达到了接近零的精度（0-5.36％），即使有明确的提示。人类本能地解决了这种歧义，但是由于对高级语义的过度依赖，VLMS失败了。令人惊讶的是，我们通过简单地将图像缩放到低分辨率（32-128像素）来提出Semvink（语义视觉思维），该图像通过消除冗余的视觉噪声来解锁> 99％的精度。这揭示了关键的建筑缺陷：VLMS优先考虑抽象推理，而不是对现实世界鲁棒性至关重要的低级视觉操作。我们的工作敦促向整合多尺度处理的混合模型转变，从而在医学成像，安全性及其他方面的应用中弥合了计算视觉和人类认知之间的差距。

Title: ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations

Authors: Ekaterina Grishina, Mikhail Gorbunov, Maxim Rakhuba
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02818
Pdf URL: https://arxiv.org/pdf/2506.02818
Copy Paste: [[2506.02818]] ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations(https://arxiv.org/abs/2506.02818)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at this https URL
摘要：大型语言模型（LLMS）在自然语言处理任务中表现出令人印象深刻的结果，但需要大量的计算和内存资源。结构化矩阵表示是减少这些模型参数数量的有希望的方法。但是，期望预期模型的重量矩阵可以通过结构化矩阵而没有任何微调来准确地表示，这似乎是不现实的。为了克服这个问题，我们利用了一个事实，即在重量矩阵的某些正交转换下，LLM输出是不变的。可以利用这种见识来确定变换，从而显着改善结构化类中权重的可压缩性。所提出的方法适用于支持有效投影操作的各种类型的结构化矩阵。代码可在此HTTPS URL上找到

Title: TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference

Authors: Yulin Dou, Jiangming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02827
Pdf URL: https://arxiv.org/pdf/2506.02827
Copy Paste: [[2506.02827]] TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference(https://arxiv.org/abs/2506.02827)
Keywords: language model, llm
Abstract: Large language models (LLMs) can effectively elicit human preferences through multi-turn dialogue. Complex tasks can be accomplished through iterative clarifying questions and final responses generated by an LLM acting as a questioner (STaR-GATE; Andukuri et al., 2024}). However, existing approaches based on self-taught reasoning struggle to identify optimal dialogue trajectories and avoid irrelevant questions to the tasks. To address this limitation, we propose TO-GATE, a novel framework that enhances question generation through trajectory optimization, which consists of two key components: a clarification resolver that generates optimal questioning trajectories, and a summarizer that ensures task-aligned final responses. The trajectory optimization enables the model to produce effective elicitation questions and summary responses tailored to specific tasks. Experimental results demonstrate that TO-GATE significantly outperforms baseline methods, achieving a 9.32% improvement on standard preference elicitation tasks.
摘要：大型语言模型（LLM）可以通过多转对话有效地引起人类的偏好。复杂的任务可以通过迭代澄清的问题和最终回答由LLM充当发问者（Star-Gate; Andukuri et al。，2024}）产生的最终回答。但是，基于自学成熟的推理斗争以确定最佳对话轨迹并避免与任务无关的问题的现有方法。为了解决这一限制，我们提出了Togate，这是一个通过轨迹优化增强问题生成的新颖框架，该框架由两个关键组成部分组成：澄清分辨率，生成最佳的提问轨迹，并确保任务与任务一致的最终响应。轨迹优化使该模型能够产生有效的引起问题和针对特定任务量身定制的摘要回答。实验结果表明，To-Gate显着胜过基线方法，在标准偏好启发任务上取得了9.32％的提高。

Title: Token and Span Classification for Entity Recognition in French Historical Encyclopedias

Authors: Ludovic Moncla, Hédi Zeghidi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.02872
Pdf URL: https://arxiv.org/pdf/2506.02872
Copy Paste: [[2506.02872]] Token and Span Classification for Entity Recognition in French Historical Encyclopedias(https://arxiv.org/abs/2506.02872)
Keywords: language model, prompt
Abstract: Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art performance, especially on nested entities, generative models offer promising alternatives when labeled data are scarce. The study highlights ongoing challenges in historical NER and suggests avenues for hybrid approaches combining symbolic and neural methods to better capture the intricacies of early modern French text.
摘要：历史文本中指定的实体识别（NER）提出了由于非标准化语言，古老的拼字法以及嵌套或重叠的实体而引起的独特挑战。这项研究基准了多种NER方法，从经典的条件随机字段（CRF）和基于SPACY的模型到基于变压器的架构，例如Camembert和Flair等序列标记的模型。实验是在Geoedda数据集上进行的，Geoedda数据集是一种源自18世纪法国百科全书的丰富注释语料库。我们建议将框架作为令牌级别和跨度级别的分类，以适应典型的历史文档的复杂嵌套实体结构。此外，我们通过用于低资源场景的生成语言模型来评估几乎没有发动机的新兴潜力。我们的结果表明，尽管基于变压器的模型达到了最先进的性能，尤其是在嵌套实体上，但生成模型在稀缺标记的数据时提供了有希望的替代方案。该研究强调了历史NER的持续挑战，并提出了结合符号和神经方法的混合方法的途径，以更好地捕捉早期现代法国文本的复杂性。

Title: CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective

Authors: Jintian Shao, Yiming Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02878
Pdf URL: https://arxiv.org/pdf/2506.02878
Copy Paste: [[2506.02878]] CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective(https://arxiv.org/abs/2506.02878)
Keywords: language model, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes. Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes.
摘要：经过思考链（COT）提示明显地增强了大语模型在需要多步推理的任务上的性能。这种成功导致了这些模型中新兴推理能力的广泛主张。在本文中，我们提出了一个理论上的反镜头：思想链（COT）不会引起真正的抽象推理。取而代之的是，我们认为，经过思考链是一种强大的结构约束，可以指导大型语言模型模仿推理的形式。通过迫使中间步骤的产生，经过思考的链利用了序列预测和模式匹配的巨大能力，从而有效地将其输出限制为类似于相似的思维过程的序列。经过思考链（COT）提示明显地增强了大语模型在需要多步推理的任务上的性能。这种成功导致了这些模型中新兴推理能力的广泛主张。在本文中，我们提出了一个理论上的反镜头：思想链（COT）不会引起真正的抽象推理。取而代之的是，我们认为，经过思考链是一种强大的结构约束，可以指导大型语言模型模仿推理的形式。通过迫使中间步骤的产生，经过思考的链利用了序列预测和模式匹配的巨大能力，从而有效地将其输出限制为类似于相似的思维过程的序列。

Title: IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator

Authors: Yusuke Sakai, Takumi Goto, Taro Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02899
Pdf URL: https://arxiv.org/pdf/2506.02899
Copy Paste: [[2506.02899]] IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator(https://arxiv.org/abs/2506.02899)
Keywords: language model
Abstract: We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.
摘要：我们提出了具有语法误差（GED）功能的新型无参考的自动语法校正校正（GEC）评估方法。我们专注于现有的自动GEC评估方法Impara的质量估计器，并使用具有增强的GED功能的预训练的语言模型来构建Imparage的质量估计器。 SEEDA的实验结果是一种用于自动GEC评估方法的元评估数据集，表明Imparaged与人类句子级别的评估达到了最高的相关性。

Title: Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning

Authors: Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu
Subjects: cs.CL, cs.AI, cs.CE, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02911
Pdf URL: https://arxiv.org/pdf/2506.02911
Copy Paste: [[2506.02911]] Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning(https://arxiv.org/abs/2506.02911)
Keywords: language model, llm
Abstract: Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at this https URL.
摘要：细胞类型注释是分析单细胞RNA测序数据的异质性的关键任务。尽管最近的基础模型使该过程自动化，但它们通常会独立注释细胞，而无需考虑批处理级的细胞环境或提供解释性推理。相反，人类专家经常根据其领域知识来注释不同细胞簇的不同细胞类型。为了模仿此工作流程，我们介绍了CellPuzzles任务，其目的是为一批单元格分配唯一的单元格类型。该基准跨越各种组织，疾病和供体条件，并需要在批处理级别的细胞环境中进行推理，以确保标记唯一性。我们发现，现成的大语言模型（LLM）在细胞插曲上挣扎，最佳基线（OpenAI'S O1）仅达到19.0％的批处理水平准确性。为了填补这一空白，我们提出了Cell-O1，这是一个通过对蒸馏推理痕迹进行监督的微调训练的7B LLM，然后通过批处理级别的奖励进行加固学习。 Cell-O1实现最先进的性能，超过73％的O1超过73％，并且在跨环境中概括。对培训动力和推理行为的进一步分析为批处理级注释绩效和新兴专家般的推理提供了见解。代码和数据可在此HTTPS URL上找到。

Title: A Controllable Examination for Long-Context Language Models

Authors: Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z.Pan, Ivan Titov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02921
Pdf URL: https://arxiv.org/pdf/2506.02921
Copy Paste: [[2506.02921]] A Controllable Examination for Long-Context Language Models(https://arxiv.org/abs/2506.02921)
Keywords: language model
Abstract: Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: $\textit{seamless context}$, $\textit{controllable setting}$, and $\textit{sound evaluation}$. This study introduces $\textbf{LongBioBench}$, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.
摘要：现有用于评估长篇小说语言模型（LCLM）的框架可以广泛地分为现实世界和综合任务。尽管它们的效用，但两种方法都伴随着某些内在局限性。现实世界的任务太复杂，无法解释或表征，并且容易受到数据污染的影响。相比之下，合成任务通常采用针线中的针刺（NIAH）格式，其中“针头”和“ Haystack”之间缺乏连贯性会损害其作为现实应用程序的代理的有效性。为了应对这些挑战，我们认为应该以三个基本功能为特征：$ \ textit {nless context} $，$ \ textit {可控设置} $，以及$ \ textit {sound {sound评估} $。这项研究介绍了$ \ textbf {longbiobench} $，这是一种新颖的基准，该基准利用人为生成的传记作为控制环境，用于评估跨$ \ textit {sexepent {sexepent} $，$ \ textit {clieveit {concooming} $的LCLM，以及$ \ textit {privtit {textit {textit {trustworthiness} $。我们的实验评估总共包括$ \ textbf {18} $ LCLMS，表明大多数模型仍然在语义理解和基本推理中表现出缺陷，而基于检索结果的基础推理，并且随着上下文长度的增加，它不太值得信赖。我们的进一步分析表明，现有合成基准测试所采用的一些设计选择，例如上下文非固定，数值针和缺乏干扰器，使它们容易受到测试模型长期文化功能。此外，我们还揭示了长篇文化持续预处理主要会调节绳索嵌入以适应扩展的上下文长度。总而言之，与以前的合成基准相比，Longbiobench在镜像真实的语言任务和保持可控性之间取得了更好的权衡，并且是高度可解释和可配置的。

Title: INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification

Authors: Diogo A.P. Nunes, Eugénio Ribeiro
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02924
Pdf URL: https://arxiv.org/pdf/2506.02924
Copy Paste: [[2506.02924]] INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification(https://arxiv.org/abs/2506.02924)
Keywords: language model, llm, prompt
Abstract: In this work, we describe our team's approach to eRisk's 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck's Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1,000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI's symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each BDI symptom, and evaluated accordingly. To that end, we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the official IR evaluation, outperforming submissions from 16 other teams.
摘要：在这项工作中，我们描述了团队对Erisk 2025任务1：寻找抑郁症状的方法。鉴于一组句子和贝克的抑郁量库存-II（BDI）问卷，参与者的任务是在BDI中提交每次抑郁症状多达1,000个句子，并按相关性排序。根据标准信息检索（IR）指标评估参与者的提交，包括平均精度（AP）和R-Precision（R-PREC）。但是，所提供的培训数据由标有有关给定句子是否相关的句子组成。 BDI的症状之一。由于这种标签限制，我们将发展作为每个BDI症状的二元分类任务，并进行了相应的评估。为此，我们将可用标记的数据分为培训和验证集，并探索基础模型微调，句子相似性，大语言模型（LLM）提示和集成技术。验证结果表明，微调基础模型产生了最佳性能，尤其是在增强合成数据以减轻类不平衡时。我们还观察到，最佳方法因症状而异。基于这些见解，我们设计了五个独立的测试运行，其中两种使用了集合方法。这些跑步在官方IR评估中取得了最高的成绩，表现优于其他16支球队的提交。

Title: Quantitative LLM Judges

Authors: Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02945
Pdf URL: https://arxiv.org/pdf/2506.02945
Copy Paste: [[2506.02945]] Quantitative LLM Judges(https://arxiv.org/abs/2506.02945)
Keywords: language model, llm
Abstract: LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
摘要：LLM-AS-A-Gudge是一个框架，其中大语言模型（LLM）自动评估另一个LLM的输出。我们提出了定量LLM法官，使用回归模型将现有LLM法官的评估评分与给定领域的人类分数保持一致。通过使用法官的文字评估和分数，对模型进行了培训，以提高原始法官的得分。我们为不同类型的绝对反馈和相对反馈提供了四个定量法官，这些法官展示了我们框架的一般性和多功能性。我们的框架比监督的微调更有效地计算效率，并且当人类反馈受到限制时，在我们工作的大多数应用中都可以预期，这在统计上更有效。我们使用两个基本法官在四个数据集上经验验证了这些主张。我们的实验表明，定量法官可以通过事后建模有效地提高现有法官的预测能力。

Title: Adaptive Graph Pruning for Multi-Agent Communication

Authors: Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2506.02951
Pdf URL: https://arxiv.org/pdf/2506.02951
Copy Paste: [[2506.02951]] Adaptive Graph Pruning for Multi-Agent Communication(https://arxiv.org/abs/2506.02951)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58\%\sim 9.84\%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90\%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.
摘要：基于大型语言模型（LLM）的多代理系统在各种任务中都表现出色，尤其是当通过协作沟通增强时。但是，当前的方法通常依靠固定数量的代理和静态通信结构，从而限制了它们适应不同任务复杂性的能力。在本文中，我们提出了自适应图修剪（AGP），这是一种新型的任务自适应多代理协作框架，共同优化了代理数量（硬质）和通信拓扑（软拓扑）。具体而言，我们的方法采用了两阶段的培训策略：首先，针对不同代理数量的独立培训软培训网络，以确定特定任务跨特定任务的最佳特定于特定于特定的特定于特定于特定的完整图和位置掩码；然后在最大完整的图中共同优化硬构和软盘，以动态配置代理的数量及其每个任务的通信拓扑。广泛的实验表明，我们的方法是：（1）在六个基准测试中，高性能，实现最先进的结果，并始终概括多个主流LLM架构，其性能提高了2.58美元\％\％\ sim 9.84 \％\％$; （2）任务自适应，动态构建针对特定任务量身定制的优化通信拓扑，在所有三个任务类别（一般推理，数学推理和代码生成）中的性能极高；（3）令牌经济学，同时拥有更少的培训步骤和令牌消费，令牌消费量减少了$ 90 \％+$；（4）训练有效，与其他方法相比，几乎没有训练步骤来实现高性能。在六个基准下进行训练大约十个步骤后，该性能将超过现有的基准。

Title: HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring

Authors: Zhixiong Su, Yichen Wang, Herun Wan, Zhaohan Zhang, Minnan Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02959
Pdf URL: https://arxiv.org/pdf/2506.02959
Copy Paste: [[2506.02959]] HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring(https://arxiv.org/abs/2506.02959)
Keywords: language model, llm
Abstract: The misuse of large language models (LLMs) poses potential risks, motivating the development of machine-generated text (MGT) detection. Existing literature primarily concentrates on binary, document-level detection, thereby neglecting texts that are composed jointly by human and LLM contributions. Hence, this paper explores the possibility of fine-grained MGT detection under human-AI coauthoring. We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio. Specifically, we propose a dataset, HACo-Det, which produces human-AI coauthored texts via an automatic pipeline with word-level attribution labels. We retrofit seven prevailing document-level detectors to generalize them to word-level detection. Then we evaluate these detectors on HACo-Det on both word- and sentence-level detection tasks. Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score, while finetuned models show superior performance and better generalization across domains. However, we argue that fine-grained co-authored text detection is far from solved. We further analyze factors influencing performance, e.g., context window, and highlight the limitations of current methods, pointing to potential avenues for improvement.
摘要：大型语言模型（LLMS）的滥用带来了潜在的风险，激发了机器生成的文本（MGT）检测的发展。现有文献主要集中于二元，文档级检测，从而忽略了由人类和LLM贡献共同组成的文本。因此，本文探讨了在人类合着下进行细粒度MGT检测的可能性。我们建议细粒探测器可以用数字AI比铺平途径朝着编写的文本检测。具体而言，我们提出了一个数据集Haco-Det，该数据集通过具有单词级属性标签的自动管道来生成人类ai的文本。我们对七个盛行的文档级检测器进行了改造，以将其推广到单词级检测。然后，我们在单词和句子级检测任务上对HACO-DET上的这些检测器进行评估。经验结果表明，基于公制的方法很难以0.462的平均F1得分进行细粒度检测，而燃料模型则表现出较高的性能和更好的跨域概括。但是，我们认为细粒度的共同作品的文本检测远非解决。我们进一步分析了影响性能的因素，例如上下文窗口，并突出了当前方法的局限性，指出了潜在的改进途径。

Title: FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Authors: Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02961
Pdf URL: https://arxiv.org/pdf/2506.02961
Copy Paste: [[2506.02961]] FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models(https://arxiv.org/abs/2506.02961)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.
摘要：大型语言模型（LLM）已经在不同领域取得了最新的结果，但是它们的发展仍然依赖大量的公开数据，这引起了人们对数据稀缺性的担忧，并且缺乏获得特定领域的敏感信息。联邦学习（FL）提出了一个令人信服的框架，可以通过在未共享原始数据的情况下对预训练的LLM进行分散的微调来解决这些挑战。但是，在FL设置中，预先训练的LLM的兼容性和性能在很大程度上仍在探索中。我们介绍了FlowerTune LLM排行榜，这是一种首个基准测试套件，旨在评估四个不同领域的LLM的联合微调：NLP一般，金融，医疗，医疗和编码。每个领域都包含联合指令调整数据集和特定于域的评估指标。我们的结果是通过协作，开源和社区驱动的方法获得的，在联合环境下具有不同的汇总和微调策略，提供了首次对26个预训练的LLM进行的全面比较，从而为模型性能，资源约束和领域适应提供了可行的见解。这项工作奠定了为现实世界应用开发隐私，域专用LLM的基础。

Title: Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

Authors: Dingwei Chen, Ziqiang Liu, Feiteng Fang, Chak Tou Leong, Shiwen Ni, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang, Chengming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02973
Pdf URL: https://arxiv.org/pdf/2506.02973
Copy Paste: [[2506.02973]] Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation(https://arxiv.org/abs/2506.02973)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ''hallucinations'', remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs' internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.
摘要：大型语言模型（LLMS）在文本理解和产生中表现出了显着的功能。但是，他们产生实际不一致的产出的趋势（通常称为“幻觉”）仍然是一个关键的挑战。现有方法（例如基于检索的和推理时间校正方法）主要在输入或输出级别上解决此问题，通常忽略了内在信息完善过程和早产层的作用。同时，基于对齐和微调的方法是资源密集的。在本文中，我们提出了PLI（过早的层插值），这是一种新颖，无训练和插件的干预措施，旨在增强事实。 PLI通过插入通过与相邻层的数学插值形成的过早层来减轻幻觉。受稳定的扩散和采样步骤的启发，PLI扩展了LLMS中信息处理和传输的深度，从而提高了事实相干性。在四个公开可用数据集上的实验表明，在大多数情况下，PLI有效地降低了幻觉，而在大多数情况下表现优于现有基准。进一步的分析表明，层插值的成功与LLMS的内部机制密切相关。为了促进可重复性，我们将在接受后发布我们的代码和数据。

Title: Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

Authors: Richard Armitage
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.02987
Pdf URL: https://arxiv.org/pdf/2506.02987
Copy Paste: [[2506.02987]] Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis(https://arxiv.org/abs/2506.02987)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.
摘要：背景：大型语言模型（LLM）表现出了支持临床实践的巨大潜力。除了聊天GPT4及其前任外，很少有LLM，尤其是领先和更强大的推理模型类别的LLM，还受到了医学专业考试问题，包括在初级保健领域。本文旨在测试截至2025年5月的领先LLM的能力（O3，Claude Opus 4，Grok3和Gemini 2.5 Pro）在初级保健教育中，特别是在回答皇家全科医生学院（MRCGP）风格考试问题时。方法：O3，Claude Opus 4，Grok3和Gemini 2.5 Pro的任务是从2025年5月25日从皇家全科医生GP自助测试中随机选择的100个多项选择问题。问题包括文本信息，实验室结果，实验室结果和临床图像。提示每个模型在英国作为GP回答，并提供完整的问题信息。每个模型都尝试一次尝试每个问题。反应是根据GP自学提供的正确答案评分的。结果：O3，Claude Opus 4，Grok3和Gemini 2.5 Pro的总得分分别为99.0％，95.0％，95.0％和95.0％。同一问题的平均同行分数为73.0％。讨论：所有模型的表现都非常出色，所有模型都超过了回答相同问题的GP和GP注册商的平均表现。 O3表现出了最佳性能，而其他领先模型的性能彼此相当，并且不高于O3。这些发现加强了LLM，尤其是推理模型的情况，以支持初级保健的交付，尤其是那些专门针对初级保健临床数据的培训的情况。

Title: It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems

Authors: Iuliia Zaitova, Badr M. Abdullah, Wei Xue, Dietrich Klakow, Bernd Möbius, Tania Avgustinova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02995
Pdf URL: https://arxiv.org/pdf/2506.02995
Copy Paste: [[2506.02995]] It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems(https://arxiv.org/abs/2506.02995)
Keywords: language model
Abstract: Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.
摘要：成语被定义为一组具有比喻性含义的单词，这些含义无法从其各个组成部分中推论。尽管现代的机器翻译系统取得了显着的进步，但翻译成语仍然是一个主要挑战，尤其是对于语音到文本系统，有关该主题的研究显然很少。在本文中，与文本到文本机器翻译（MT）中的传统新闻翻译相比，我们系统地评估了成语翻译，并在两种语言对（德语到英语，俄语到英语）中的文本转换（MT）和语音到文本翻译（SLT）系统。我们将最新的端到端SLT系统（SeamlessM4T SLT-TOXT，Whisper Gigal V3）与MT Systems（SeamlessM4T Slt-to-Toxt，没有留下的语言），大语言模型（Deepseek，Llama）和级联替代方案进行了比较。我们的结果表明，SLT系统在惯用数据上经历了明显的性能下降，即使在较高的层中也恢复到文字翻译，而MT系统和大语言模型表现出更好的习惯处理。这些发现强调了对特定于习语的策略的需求，并改善了SLT体系结构中的内部表示。

Title: A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Authors: Đorđe Klisura, Astrid R Bernaga Torres, Anna Karen Gárate-Escamilla, Rajesh Roshan Biswal, Ke Yang, Hilal Pataci, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02998
Pdf URL: https://arxiv.org/pdf/2506.02998
Copy Paste: [[2506.02998]] A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems(https://arxiv.org/abs/2506.02998)
Keywords: gpt, agent
Abstract: Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini's zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.
摘要：隐私政策将数据收集和用法告知用户，但它们的复杂性限制了对不同人群的可访问性。现有的隐私政策问题答案（QA）系统在英语方言之间表现出性能差异，非标准品种的弱势扬声器。我们提出了一个以人为本的设计原理启发的新型多代理框架，以减轻方言偏见。我们的方法集成了一种方言代理，该方言代理将查询转化为标准的美国英语（SAE），同时保留了方言意图，以及一个使用域专业知识的预测。与先前的方法不同，我们的方法不需要再培训或方言特定的微调，从而在模型和域中广泛适用。我们对PrivacyQA和PolicyQA进行了评估，我们的框架将GPT-4O-Mini的零弹药准确度从privacyqa上的0.394提高到0.601，从0.352提高到0.352到0.464，而PolicyQA则在没有其他培训数据的情况下超越或匹配了很少的STORELINE。这些结果突出了结构化代理协作在缓解方言偏见方面的有效性，并强调了设计NLP系统的重要性，该系统可以说明语言多样性，以确保公平访问隐私信息。

Title: Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech

Authors: Florian Ludwig, Torsten Zesch, Frederike Zufall
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03009
Pdf URL: https://arxiv.org/pdf/2506.03009
Copy Paste: [[2506.03009]] Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech(https://arxiv.org/abs/2506.03009)
Keywords: language model, llm
Abstract: The assessment of legal problems requires the consideration of a specific legal system and its levels of abstraction, from constitutional law to statutory law to case law. The extent to which Large Language Models (LLMs) internalize such legal systems is unknown. In this paper, we propose and investigate different approaches to condition LLMs at different levels of abstraction in legal systems. This paper examines different approaches to conditioning LLMs at multiple levels of abstraction in legal systems to detect potentially punishable hate speech. We focus on the task of classifying whether a specific social media posts falls under the criminal offense of incitement to hatred as prescribed by the German Criminal Code. The results show that there is still a significant performance gap between models and legal experts in the legal assessment of hate speech, regardless of the level of abstraction with which the models were conditioned. Our analysis revealed, that models conditioned on abstract legal knowledge lacked deep task understanding, often contradicting themselves and hallucinating answers, while models using concrete legal knowledge performed reasonably well in identifying relevant target groups, but struggled with classifying target conducts.
摘要：对法律问题的评估需要考虑特定的法律体系及其抽象水平，从宪法到法定法到判例法。大型语言模型（LLMS）在多大程度上将这种法律制度内部化。在本文中，我们提出并调查了在法律系统中不同级别的抽象水平的LLM的不同方法。本文研究了法律系统中多个抽象的LLM调节LLM的不同方法，以检测潜在的受惩罚性仇恨言论。我们专注于分类特定社交媒体职位是否属于德国刑法规定的仇恨的刑事犯罪。结果表明，在仇恨言论的法律评估中，模型与法律专家之间仍然存在显着的绩效差距，而不管模型的措施水平如何。我们的分析表明，以抽象法律知识为条件的模型缺乏深厚的任务理解，通常与自己矛盾和幻觉答案，而使用具体法律知识的模型在确定相关目标群体方面表现出色，但在分类目标行为方面挣扎。

Title: Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Authors: Aditya Bharat Soni, Boxuan Li, Xingyao Wang, Valerie Chen, Graham Neubig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03011
Pdf URL: https://arxiv.org/pdf/2506.03011
Copy Paste: [[2506.03011]] Coding Agents with Multimodal Browsing are Generalist Problem Solvers(https://arxiv.org/abs/2506.03011)
Keywords: agent
Abstract: Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.
摘要：现代人工劳动的特征是专业化。我们训练多年并开发了特定的工具，使我们能够在各种任务中表现良好。此外，AI代理商已专门用于软件工程，Web导航和工作流程自动化等领域。但是，这导致代理对一件事有益但无法推广到其预期范围之外。原因之一是代理开发人员提供一套高度专业化的工具或对特定用例或基准进行优化的架构决策。在这项工作中，我们提出了一个问题：什么是在各种任务集中实现高性能的最小一组通用工具集？我们的答案是openhands-versa，这是一种通才代理，它具有适度数量的通用工具：代码编辑和执行，Web搜索以及多模式Web浏览和文件访问。重要的是，OpenHands-Versa表现出比在三种不同的和挑战基准的领先专业代理商中表现出的优越或竞争性能：SWE-Bench Multimopal，Gaia和The Agent Company的表现优于先前发表的最佳成绩，其成功率分别为9.1、1.3和9.1点，其成功率的绝对提高了。此外，我们展示了现有的最先进的多代理系统如何无法推广其目标域。这些结果表明，开发通才代理人解决各种任务的可行性，并建立反之亦然，作为未来研究的强大基准。

Title: Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning

Authors: Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.03035
Pdf URL: https://arxiv.org/pdf/2506.03035
Copy Paste: [[2506.03035]] Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning(https://arxiv.org/abs/2506.03035)
Keywords: language model, llm, prompt
Abstract: Understanding user queries is fundamental in many applications, such as home assistants, booking systems, or recommendations. Accordingly, it is crucial to develop accurate Spoken Language Understanding (SLU) approaches to ensure the reliability of the considered system. Current State-of-the-Art SLU techniques rely on large amounts of training data; however, only limited annotated examples are available for specific tasks or languages. In the meantime, instruction-tuned large language models (LLMs) have shown exceptional performance on unseen tasks in a few-shot setting when provided with adequate prompts. In this work, we propose to explore example selection by leveraging Information retrieval (IR) approaches to build an enhanced prompt that is applied to an SLU task. We evaluate the effectiveness of the proposed method on several SLU benchmarks. Experimental results show that lexical IR methods significantly enhance performance without increasing prompt length.
摘要：了解用户查询在许多应用程序（例如家庭助理，预订系统或建议）中至关重要。因此，建立准确的口语理解（SLU）方法以确保所考虑系统的可靠性至关重要。当前的最新SLU技术依赖大量的培训数据；但是，只有有限的注释示例可用于特定的任务或语言。同时，指令调整的大语言模型（LLMS）在提供足够的提示时以几次弹射设置显示出了出色的表现。在这项工作中，我们建议通过利用信息检索方法（IR）方法来探索示例选择，以构建应用于SLU任务的增强提示。我们评估了所提出的方法对几个SLU基准的有效性。实验结果表明，词汇IR方法可显着提高性能，而不会增加及时长度。

Title: Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

Authors: Jintian Shao, Yiming Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03038
Pdf URL: https://arxiv.org/pdf/2506.03038
Copy Paste: [[2506.03038]] Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective(https://arxiv.org/abs/2506.03038)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Reinforcement learning (RL) enhances large language models (LLMs) in complex, long-chain-of-thought (long-CoT) reasoning. The advanced VAPO framework, despite sophisticated mechanisms like Decoupled GAE, theoretically faces fundamental limitations in comprehensively modeling and leveraging deep, long-term value for fine-grained, step-by-step policy guidance in extended reasoning chains. We argue these limitations stem from inherent difficulties in credit assignment, value function representational capacity with temporally abstracted goals, and translating global value signals into local policy improvements, especially with sparse rewards. Our theoretical analysis examines these aspects to illuminate VAPO's boundaries in long-term value modeling, aiming to deepen understanding of current RL for advanced reasoning and suggest future research for more robust LLM agents.
摘要：强化学习（RL）增强了复杂，长链（长期）推理中的大型语言模型（LLM）。高级VAPO框架，尽管诸如GAE等复杂的机制，但理论上仍然面临着全面建模和利用深度，长期价值的基本限制，以实现扩展推理链中细粒度，分步政策指导。我们认为这些局限性源于信用分配的固有困难，具有时间抽象的目标的价值功能代表能力，以及将全球价值信号转化为本地政策改进，尤其是在稀疏奖励的情况下。我们的理论分析研究了这些方面，以在长期价值建模中阐明VAPO的边界，旨在加深对当前RL的高级推理的了解，并为更强大的LLM代理提出未来的研究。

Title: Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

Authors: Yuval Kansal, Shmuel Berman, Lydia Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03051
Pdf URL: https://arxiv.org/pdf/2506.03051
Copy Paste: [[2506.03051]] Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs(https://arxiv.org/abs/2506.03051)
Keywords: language model, llm
Abstract: Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.
摘要：事实是有用的教育工具的必要先驱。随着教育中大型语言模型（LLM）的采用持续增长，确保所有环境中的正确性至关重要。尽管具有强大的英语能力，但在其他语言中的LLM表现在很大程度上未经测试。在这项工作中，我们评估了Llama3.1模型家族在回答适合初中学生和高中生的事实问题时的正确性。我们证明，LLM不仅提供了无关紧要的，更不诚实的信息，而且加剧了现有的对罕见语言的偏见。

Title: Literary Evidence Retrieval via Long-Context Language Models

Authors: Katherine Thai, Mohit Iyyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03090
Pdf URL: https://arxiv.org/pdf/2506.03090
Copy Paste: [[2506.03090]] Literary Evidence Retrieval via Long-Context Language Models(https://arxiv.org/abs/2506.03090)
Keywords: language model, llm
Abstract: How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.
摘要：现代长篇文章模型对文学小说的了解如何？我们通过文学证据检索的任务来探索这个问题，重新利用该等人的遗物数据集。（2022）构建基准标准，其中将主要来源的整个文本（例如，Great Gatsby）与文学批评一起提供给LLM，而该作品缺少引用。该模型必须产生丢失的引号的设置，通过要求模型同时执行全球叙事推理和仔细的文本考试来反映人类的文学分析过程。我们通过广泛的过滤和人类验证来策划292个示例的高质量子集。我们的实验表明，最近的推理模型（例如Gemini Pro 2.5）可以超过人类的专家绩效（62.5％比50％的精度）。相比之下，最佳的开放权重模型仅达到29.1％的精度，突出了开放式和封闭模型之间的解释性推理差距很大。尽管它们的速度和明显的准确性，但即使是最强大的模型也与细微差别的文学信号和过度代表作斗争，这表明将LLMS应用于文学分析的挑战。我们发布我们的数据集和评估代码，以鼓励将来朝这个方向朝着这个方向发展。

Title: Beyond Text Compression: Evaluating Tokenizers Across Scales

Authors: Jonas F. Lotz, António V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03101
Pdf URL: https://arxiv.org/pdf/2506.03101
Copy Paste: [[2506.03101]] Beyond Text Compression: Evaluating Tokenizers Across Scales(https://arxiv.org/abs/2506.03101)
Keywords: language model
Abstract: The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf's law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our work offers a more efficient path to informed tokenizer selection in future language model development.
摘要：代币器的选择可以深刻影响语言模型的表现，但是对代币质量的可访问和可靠的评估仍然是一个开放的挑战。受缩放一致性的启发，我们表明，较小的模型可以准确预测令牌剂对较大模型的影响，这是计算成本的一部分。通过系统地评估以英语为中心的和多语言的引物，我们发现令牌选择对英语的任务有微不足道的影响，但会导致多语言设置中的性能差异一致。我们提出的是受ZIPF定律启发的新的固有的代币指标，该指标与下游性能相比，在建模看不见的语言时，与下游性能更加密切。通过组合几个指标以捕获令牌行为的多个方面，我们为内在的令牌评估开发了可靠的框架。我们的工作为未来的语言模型开发中的知情令牌选择提供了更有效的途径。

Title: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Authors: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chao Yang, Helen Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03106
Pdf URL: https://arxiv.org/pdf/2506.03106
Copy Paste: [[2506.03106]] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback(https://arxiv.org/abs/2506.03106)
Keywords: language model, llm
Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.
摘要：使用数值反馈（例如标量奖励）的加强学习（RL）的最新进展显着增强了大语言模型（LLMS）的复杂推理能力。尽管取得了成功，但我们确定了RL遇到的三个关键挑战，只有数值反馈：性能高原，自我反射的有效性和持续失败。然后，我们证明，即使在表现出性能高原之后，RL-FineTy的模型也可以通过以批评形式利用自然语言反馈来对持续失败的问题产生正确的改进。在此洞察力的基础上，我们提出了批判性GRPO，这是一个在线RL框架，该框架同时整合了自然语言和数值反馈，以进行有效的政策优化。批判性GRPO使LLM可以同时从初始反应和批判引导的改进中学习，同时保持探索。使用QWEN2.5-7B基本和QWEN3-8B基础进行的广泛实验表明，批评始终超过了八个具有挑战性的数学，STEM和一般推理任务，将基于学习的基于学习和基于RL的微调方法胜过基于学习和RL的微调方法，分别将平均速度@1分@1分别提高了4.5％和5％。值得注意的是，批评GRPO超过了强大的基线，该基线将专家演示纳入在线RL中。进一步的分析揭示了有关政策探索的两个关键见解：（1）更高的熵并不总是保证从探索中有效学习，并且（2）更长的回答不一定会导致更有效的探索。

Title: AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation

Authors: Prashanth Vijayaraghavan, Luyao Shi, Ehsan Degan, Vandana Mukherjee, Xin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03122
Pdf URL: https://arxiv.org/pdf/2506.03122
Copy Paste: [[2506.03122]] AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation(https://arxiv.org/abs/2506.03122)
Keywords: language model, llm, prompt
Abstract: Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL,a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.
摘要：模拟电路拓扑综合是电子设计自动化（EDA）不可或缺的一部分，可以自动创建根据特定设计要求量身定制的电路结构。但是，庞大的设计搜索空间和严格的约束依从性使有效的合成具有挑战性。利用大型语言模型（LLM）的多功能性，我们提出了AutoCircuit-RL，这是一种新型的加固学习（RL）基于自动模拟电路合成的框架。该框架分为两个阶段：指令调整，LLM学会从编码设计约束的结构化提示中生成电路拓扑以及RL改进，从而进一步使用奖励模型来评估有效性，效率和输出电压。然后，精制模型直接用于生成满足设计约束的拓扑结构。经验结果表明，与最佳基准相比，AutoCircuit-RL产生约12％的有效电路，并提高效率约14％，而将重复的发电率降低了约38％。通过有限的训练数据，它在合成有效电路的合成中取得了超过60％的成功，这表明了强烈的概括。这些发现突出了该框架在扩展到复杂电路方面的有效性，同时保持效率和约束依从性，这标志着AI驱动的电路设计的显着进步。

Title: Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Authors: Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03136
Pdf URL: https://arxiv.org/pdf/2506.03136
Copy Paste: [[2506.03136]] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning(https://arxiv.org/abs/2506.03136)
Keywords: llm, agent
Abstract: We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: this https URL
摘要：我们提出了Cure，这是一种新颖的增强学习框架，具有专用的奖励设计，该设计基于它们的互动结果，可协调编码和单位测试生成功能，而没有任何基本真相代码作为监督。这种方法可以灵活，可扩展的培训，并使单元测试仪可以直接从编码员的错误中学习。在优化QWEN2.5教学模型之后，我们派生的理由Flux-coder-7b和14B模型将代码生成的准确性提高了5.3％，最佳N精度提高了9.0％，表现优于类似尺寸的Qwen-coder，deepseek-seek-coder和seed-coder。它们自然扩展到下游任务，例如测试时间缩放和代理编码，比基本模型提高了8.1％。对于长期计算模型，我们的理由flux-coder-4b始终优于QWEN3-4B，同时达到单位测试生成的推理效率为64.8％。值得注意的是，我们还发现我们的模型可以作为基础模型增强学习的有效奖励模型。项目：此HTTPS URL

Title: GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Authors: Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03143
Pdf URL: https://arxiv.org/pdf/2506.03143
Copy Paste: [[2506.03143]] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents(https://arxiv.org/abs/2506.03143)
Keywords: agent
Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.
摘要：构建VLM驱动的GUI代理的主要挑战之一是视觉接地，即，基于视觉内容和文本计划，将适当的屏幕区域定位以进行操作执行。大多数现有的工作将其作为基于文本的坐标生成任务。但是，这些方法受到了几种局限性：空间语义的对准弱，无法处理模棱两可的监督目标以及屏幕坐标的密集性质与模型诸如视觉变形金刚等模型提取的视觉特征的粗糙，斑块级粒度之间的不匹配。在本文中，我们提出了GUI-Actor，这是一种基于VLM的无坐标GUI接地方法。 Gui-Actor的核心引入了一个基于注意力的动作头，该主管学会了将专用的令牌与所有相关的视觉贴片令牌保持一致，从而使该模型能够在单个正向传球中提出一个或多个动作区域。与此相一致，我们进一步设计了一个接地验证者，以评估并从提议的行动执行的候选人中选择最合理的行动区域。广泛的实验表明，GUI-Actor在多个GUI动作基准测试基准上的先验最新方法优于先前的最新方法，并改善了概括，从而看不见屏幕的分辨率和布局。值得注意的是，GUI-ACTOR-7B甚至在屏幕杆位上超过UI-TARS-72B（38.1），用QWEN2-VL获得40.7的得分，而QWEN2.5-VL作为骨架。此外，通过合并验证器，我们发现仅通过保持VLM骨架冷冻的VLM骨架而仅对新引入的动作头（〜100m参数）（〜100m参数），足以实现与以前的最新模型相媲美的性能，强调Gui-actor可以使底层的VLM具有有效的地面能力，而无需构成其一般地位，从而可以实现其一般性的强度。

Title: Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM

Authors: Pralaypati Ta, Sriram Venkatesaperumal, Keerthi Ram, Mohanasankar Sivaprakasam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03145
Pdf URL: https://arxiv.org/pdf/2506.03145
Copy Paste: [[2506.03145]] Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM(https://arxiv.org/abs/2506.03145)
Keywords: language model, llm
Abstract: Neuroscience research publications encompass a vast wealth of knowledge. Accurately retrieving existing information and discovering new insights from this extensive literature is essential for advancing the field. However, when knowledge is dispersed across multiple sources, current state-of-the-art retrieval methods often struggle to extract the necessary information. A knowledge graph (KG) can integrate and link knowledge from multiple sources, but existing methods for constructing KGs in neuroscience often rely on labeled data and require domain expertise. Acquiring large-scale, labeled data for a specialized area like neuroscience presents significant challenges. This work proposes novel methods for constructing KG from unlabeled large-scale neuroscience research corpus utilizing large language models (LLM), neuroscience ontology, and text embeddings. We analyze the semantic relevance of neuroscience text segments identified by LLM for building the knowledge graph. We also introduce an entity-augmented information retrieval algorithm to extract knowledge from the KG. Several experiments were conducted to evaluate the proposed approaches, and the results demonstrate that our methods significantly enhance knowledge discovery from the unlabeled neuroscience research corpus. It achieves an F1 score of 0.84 for entity extraction, and the knowledge obtained from the KG improves answers to over 54% of the questions.
摘要：神经科学研究出版物包括大量知识。准确地检索现有信息并从这种广泛文献中发现新见解对于推进该领域至关重要。但是，当知识分散到多个来源时，当前的最新检索方法通常很难提取必要的信息。知识图（KG）可以从多个来源集成和链接知识，但是现有的神经科学中kgs的方法通常依赖于标记的数据并需要域专业知识。获取大规模的，标记为神经科学等专业领域的数据提出了重大挑战。这项工作提出了利用大型语言模型（LLM），神经科学本体论和文本嵌入的未标记的大规模神经科学研究语料库构建KG的新方法。我们分析了LLM确定构建知识图的神经科学文本段的语义相关性。我们还介绍了一种实体提升的信息检索算法，以从KG中提取知识。进行了几项实验以评估所提出的方法，结果表明，我们的方法显着增强了未标记的神经科学研究语料库的知识发现。实体提取的F1得分为0.84，从KG获得的知识提高了超过54％的问题的答案。

Title: Causal Estimation of Tokenisation Bias

Authors: Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03149
Pdf URL: https://arxiv.org/pdf/2506.03149
Copy Paste: [[2506.03149]] Causal Estimation of Tokenisation Bias(https://arxiv.org/abs/2506.03149)
Keywords: language model
Abstract: Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.
摘要：现代语言模型通常通过子字序列进行训练，但最终定义了角色的概率。理想情况下，将令牌的选择（将字符串映射到子字）不应影响分配给基础字符串的概率；实际上，确实如此。我们将这种不匹配定义为象征性偏见。在这项工作中，我们量化了一种特定类型的令牌化偏差：在Tokeniser的词汇中包括或不包含子字（例如$ \ langle hello \ rangle $）的效果，对训练有素的模型分配给相应的字符（即\ textit {fextit {``hello hello'''''''''''''''''''}）。估计这种效果是具有挑战性的，因为每个模型都只有一个令牌训练。我们通过将象征化偏置作为因果效应来解决这一问题，并使用回归不连续设计估算它。具体而言，我们利用了以下事实：令牌化算法排名子字，并将第一个$ k $添加到令牌词词汇中，其中$ k $是任意的截止点。因此，我们可以通过比较该截止的类似子词来估计因果效应。在实验上，我们发现令牌化始终影响跨尺度，词汇和令牌剂的模型输出。值得注意的是，子词在小型模型的词汇中的存在可能会将其角色的概率提高17次，从而突出了象征性作为语言建模中的关键设计选择。