2025-06-10

Title: How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG

Authors: Qiming Zeng, Xiao Yan, Hao Luo, Yuhao Lin, Yuxiang Wang, Fangcheng Fu, Bo Du, Quanqing Xu, Jiawei Jiang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.06331
Pdf URL: https://arxiv.org/pdf/2506.06331
Copy Paste: [[2506.06331]] How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG(https://arxiv.org/abs/2506.06331)
Keywords: language model, llm, retrieval-augmented generation
Abstract: By retrieving contexts from knowledge graphs, graph-based retrieval-augmented generation (GraphRAG) enhances large language models (LLMs) to generate quality answers for user questions. Many GraphRAG methods have been proposed and reported inspiring performance in answer quality. However, we observe that the current answer evaluation framework for GraphRAG has two critical flaws, i.e., unrelated questions and evaluation biases, which may lead to biased or even wrong conclusions on performance. To tackle the two flaws, we propose an unbiased evaluation framework that uses graph-text-grounded question generation to produce questions that are more related to the underlying dataset and an unbiased evaluation procedure to eliminate the biases in LLM-based answer assessment. We apply our unbiased framework to evaluate 3 representative GraphRAG methods and find that their performance gains are much more moderate than reported previously. Although our evaluation framework may still have flaws, it calls for scientific evaluations to lay solid foundations for GraphRAG research.
摘要：通过从知识图中检索上下文，基于图的检索增强生成（GraphRag）增强了大型语言模型（LLMS），以生成用户问题的质量答案。已经提出了许多GraphRag方法，并报道了在答案质量方面鼓舞人心的性能。但是，我们观察到，GraphRag的当前答案评估框架有两个关键缺陷，即无关的问题和评估偏见，这可能会导致对性能的偏见甚至错误的结论。为了解决这两个缺陷，我们提出了一个无偏见的评估框架，该框架使用图形 - 台面问题生成来产生与基础数据集更相关的问题，以及无偏见的评估程序，以消除基于LLM的答案评估中的偏见。我们将公正的框架应用于评估3种代表性的GraphRag方法，并发现其性能提高比以前报道的要高得多。尽管我们的评估框架可能仍然存在缺陷，但它要求科学评估为GraphRag研究奠定坚实的基础。

Title: TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment

Authors: Taesoo Kim, Jong Hwan Ko
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.06343
Pdf URL: https://arxiv.org/pdf/2506.06343
Copy Paste: [[2506.06343]] TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment(https://arxiv.org/abs/2506.06343)
Keywords: language model, llm
Abstract: Recent advances in speech-enabled language models have shown promising results in building intelligent voice assistants. However, most existing approaches rely on large-scale paired speech-text data and extensive computational resources, which pose challenges in terms of scalability and accessibility. In this paper, we present \textbf{TESU-LLM}, a novel framework that enables training speech-capable language models using only text data. Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space. By aligning the encoder output with the embedding space of a LLM via a lightweight projection network, we enable the model to generalize from text-only supervision to speech-based inference. Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks, comparable to baseline methods trained with large-scale multimodal datasets and substantial computational resources. These results highlight the effectiveness and efficiency of our approach, offering a scalable path toward building speech LLMs without speech data.
摘要：支持语音的语言模型的最新进展显示了建立智能语音助手的有希望的结果。但是，大多数现有的方法都依赖于大规模的配对语音文本数据和广泛的计算资源，这在可伸缩性和可访问性方面构成了挑战。在本文中，我们提出了\ textbf {tesu-llm}，这是一个新颖的框架，可仅使用文本数据培训培训语言能力的语言模型。我们的关键见解是利用统一的编码器，该编码器将语义上等效的文本和语音输入映射到共享的潜在空间。通过将编码器输出与LLM的嵌入空间对齐，通过轻型投影网络，我们使该模型可以从仅文本监督到基于语音的推论。尽管受过文本的培训，但Tesu-Llm在各种语音相关的基准测试中取得了强劲的性能，可与经过大规模多模式数据集和实质性计算资源训练的基线方法相媲美。这些结果突出了我们方法的有效性和效率，为构建语音LLM的可扩展路径而没有语音数据。

Title: Unified Game Moderation: Soft-Prompting and LLM-Assisted Label Transfer for Resource-Efficient Toxicity Detection

Authors: Zachary Yang, Domenico Tullo, Reihaneh Rabbany
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06347
Pdf URL: https://arxiv.org/pdf/2506.06347
Copy Paste: [[2506.06347]] Unified Game Moderation: Soft-Prompting and LLM-Assisted Label Transfer for Resource-Efficient Toxicity Detection(https://arxiv.org/abs/2506.06347)
Keywords: gpt, llm, prompt, chat
Abstract: Toxicity detection in gaming communities faces significant scaling challenges when expanding across multiple games and languages, particularly in real-time environments where computational efficiency is crucial. We present two key findings to address these challenges while building upon our previous work on ToxBuster, a BERT-based real-time toxicity detection system. First, we introduce a soft-prompting approach that enables a single model to effectively handle multiple games by incorporating game-context tokens, matching the performance of more complex methods like curriculum learning while offering superior scalability. Second, we develop an LLM-assisted label transfer framework using GPT-4o-mini to extend support to seven additional languages. Evaluations on real game chat data across French, German, Portuguese, and Russian achieve macro F1-scores ranging from 32.96% to 58.88%, with particularly strong performance in German, surpassing the English benchmark of 45.39%. In production, this unified approach significantly reduces computational resources and maintenance overhead compared to maintaining separate models for each game and language combination. At Ubisoft, this model successfully identifies an average of 50 players, per game, per day engaging in sanctionable behavior.
摘要：在跨多种游戏和语言扩展时，游戏社区中的毒性检测会面临巨大的扩展挑战，尤其是在计算效率至关重要的实时环境中。我们提出了两个关键发现，以应对这些挑战，同时我们先前在基于BERT的实时毒性检测系统Toxbuster上的工作基础上。首先，我们引入了一种软宣传方法，该方法使单个模型通过合并游戏版本令牌可以有效地处理多个游戏，从而匹配更复杂的方法，例如课程学习，同时提供出色的可扩展性。其次，我们使用GPT-4O-Mini开发了一个由LLM辅助的标签传输框架，以将支持扩展到另外七种语言。对法国，德语，葡萄牙语和俄罗斯的真实游戏聊天数据的评估范围为32.96％至58.88％，在德语中的表现尤其强劲，超过了45.39％的英语基准。在生产中，与维护每个游戏和语言组合的单独模型相比，这种统一方法可大大减少计算资源和维护间接费用。在Ubisoft，该模型平均每天都能确定50名球员，每天从事可严格的行为。

Title: Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models

Authors: Panagiotis Koletsis, Christos Panagiotopoulos, Georgios Th. Papadopoulos, Vasilis Efthymiou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06371
Pdf URL: https://arxiv.org/pdf/2506.06371
Copy Paste: [[2506.06371]] Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models(https://arxiv.org/abs/2506.06371)
Keywords: language model, llm, prompt
Abstract: Over the past few years, table interpretation tasks have made significant progress due to their importance and the introduction of new technologies and benchmarks in the field. This work experiments with a hybrid approach for detecting relationships among columns of unlabeled tabular data, using a Knowledge Graph (KG) as a reference point, a task known as CPA. This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations. The main modules of this approach for reducing the search space are domain and range constraints detection, as well as relation co-appearance analysis. The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs at various levels of quantization. The experiments were performed, as well as at different prompting techniques. The proposed methodology, which is publicly available on github, proved to be competitive with state-of-the-art approaches on these datasets.
摘要：在过去的几年中，由于其重要性以及该领域的新技术和基准的引入，桌面解释任务取得了重大进展。这项使用混合方法的工作实验，用于检测未标记表格数据列之间的关系，使用知识图（kg）作为参考点，一种称为CPA的任务。这种方法利用大型语言模型（LLM），同时采用统计分析来减少潜在的KG关系的搜索空间。该方法减少搜索空间的主要模块是域和范围约束检测，以及关系共同出现分析。 SEMTAB挑战提供的两个基准数据集的实验评估评估了每个模块的影响以及不同量化级别的不同最先进的LLM的有效性。进行了实验，以及在不同的提示技术下进行的。拟议的方法在GitHub上公开可用，被证明具有这些数据集中最先进的方法具有竞争力。

Title: Enhancing Decision-Making of Large Language Models via Actor-Critic

Authors: Heng Dong, Kefei Duan, Chongjie Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06376
Pdf URL: https://arxiv.org/pdf/2506.06376
Copy Paste: [[2506.06376]] Enhancing Decision-Making of Large Language Models via Actor-Critic(https://arxiv.org/abs/2506.06376)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decision-making scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments -- including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) -- demonstrate the framework's generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs' intrinsic knowledge to advance decision-making capabilities in multi-step environments.
摘要：大型语言模型（LLM）在自然语言处理任务中取得了显着进步，但是它们在需要长期推理和与高级目标的复杂决策场景中遇到了挑战。现有的方法依赖于短期自动回归动作的产生，或者在准确模拟推出和评估结果中面临限制，从而导致次优决策。本文介绍了一种基于LLM的新型参与者批评框架，称为LAC，该框架以原则可扩展的方式有效地通过长期行动评估来改善LLM策略。我们的方法解决了两个关键挑战：（1）通过与正/负结果相关的令牌逻辑计算Q值来提取鲁棒的行动评估，并通过未来的轨迹推广和推理来增强；（2）通过无梯度机制实现有效的政策改进。跨不同环境的实验 - 包括高级决策（ALFWORLD），低水平的动作空间（Babyai-Text）和大型动作空间（WebShop） - 演示了该框架比最先进方法的一般性和优势。值得注意的是，我们的方法使用7B/8B参数LLM实现了竞争性能，即使在复杂任务中使用GPT-4的基线方法也优于基线方法。这些结果强调了将结构化策略优化与LLMS的内在知识相结合的潜力，以提高多步环境中的决策能力。

Title: Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering

Authors: Yi Ji, Runzhi Li, Baolei Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06384
Pdf URL: https://arxiv.org/pdf/2506.06384
Copy Paste: [[2506.06384]] Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering(https://arxiv.org/abs/2506.06384)
Keywords: language model, gpt, llm, prompt
Abstract: With the widespread adoption of Large Language Models (LLMs), prompt injection attacks have emerged as a significant security threat. Existing defense mechanisms often face critical trade-offs between effectiveness and generalizability. This highlights the urgent need for efficient prompt injection detection methods that are applicable across a wide range of LLMs. To address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion detection framework. It integrates a pretrained language model with heuristic feature engineering to detect prompt injection attacks. Specifically, the framework employs DeBERTa-v3-base as a feature extractor to transform input text into semantic vectors enriched with contextual information. In parallel, we design heuristic rules based on known attack patterns to extract explicit structural features commonly observed in attacks. Features from both channels are subsequently fused and passed through a fully connected neural network to produce the final prediction. This dual-channel approach mitigates the limitations of relying only on DeBERTa to extract features. Experimental results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms existing methods in terms of accuracy, recall, and F1-score. Furthermore, when deployed actually, it significantly reduces attack success rates across mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.
摘要：随着大型语言模型（LLM）的广泛采用，迅速注射攻击已成为重大的安全威胁。现有的防御机制通常面临着有效性和普遍性之间的关键权衡。这凸显了迫切需要在广泛的LLMS中适用的有效的快速注射检测方法。为了应对这一挑战，我们提出了DMPI-PMHFE，这是一个双通道功能融合检测框架。它将验证的语言模型与启发式功能工程集成在一起，以检测及时注射攻击。具体而言，该框架采用Deberta-V3基键作为特征提取器来将输入文本转换为具有上下文信息的语义向量。同时，我们根据已知的攻击模式设计启发式规则，以提取攻击中常见的明确结构特征。随后将两个通道的功能融合并通过完全连接的神经网络以产生最终预测。这种双通道方法减轻了仅依靠Deberta提取特征的局限性。各种基准数据集的实验结果表明，DMPI-PMHFE在准确性，召回和F1得分方面优于现有方法。此外，当实际部署时，它会大大降低主流LLM的攻击成功率，包括GLM-4，Llama 3，QWEN 2.5和GPT-4O。

Title: Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Authors: Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06395
Pdf URL: https://arxiv.org/pdf/2506.06395
Copy Paste: [[2506.06395]] Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models(https://arxiv.org/abs/2506.06395)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 8 samples per question and 4 training epochs, RLSC improves accuracy by +20.10% on AIME2024, +49.40% on MATH500, and +52.50% on AMC23. RLSC offers a simple, scalable post-training method for reasoning models with minimal supervision.
摘要：大型语言模型（LLM）在推理方面表现出色，但是训练后对于使其行为与任务目标保持一致至关重要。现有的强化学习（RL）方法通常取决于昂贵的人类注释或外部奖励模型。我们建议通过自信（RLSC）进行加强学习，该学习将模型自身的信心用作奖励信号 - 阐明对标签，偏好模型或奖励工程的需求。 RLSC应用于QWEN2.5-MATH-7B，每个问题只有8个样本和4个训练时期，可在AIME2024上提高准确性 +20.10％，在MATH500上提高了 +49.40％，AMC23的精度提高了 +49.40％。 RLSC提供了一种简单，可扩展的培训方法，用于最少的监督推理模型。

Title: Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things

Authors: Christopher D. Molek, Roberto Fronteddu, K. Brent Venable, Niranjan Suri
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2506.06396
Pdf URL: https://arxiv.org/pdf/2506.06396
Copy Paste: [[2506.06396]] Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things(https://arxiv.org/abs/2506.06396)
Keywords: language model, llm
Abstract: The expansion of the Internet of Things (IoT) in the battlefield, Internet of Battlefield Things (IoBT), gives rise to new opportunities for enhancing situational awareness. To increase the potential of IoBT for situational awareness in critical decision making, the data from these devices must be processed into consumer-ready information objects, and made available to consumers on demand. To address this challenge we propose a workflow that makes use of natural language processing (NLP) to query a database technology and return a response in natural language. Our solution utilizes Large Language Models (LLMs) that are sized for edge devices to perform NLP as well as graphical databases which are well suited for dynamic connected networks which are pervasive in the IoBT. Our architecture employs LLMs for both mapping questions in natural language to Cypher database queries as well as to summarize the database output back to the user in natural language. We evaluate several medium sized LLMs for both of these tasks on a database representing publicly available data from the US Army's Multipurpose Sensing Area (MSA) at the Jornada Range in Las Cruces, NM. We observe that Llama 3.1 (8 billion parameters) outperforms the other models across all the considered metrics. Most importantly, we note that, unlike current methods, our two step approach allows the relaxation of the Exact Match (EM) requirement of the produced Cypher queries with ground truth code and, in this way, it achieves a 19.4% increase in accuracy. Our workflow lays the ground work for deploying LLMs on edge devices to enable natural language interactions with databases containing information objects for critical decision making.
摘要：战场上事物（IOBT）在战场上的物联网（IoT）的扩展为增强情境意识带来了新的机会。为了提高IOBT在关键决策中的情境意识的潜力，必须将这些设备的数据处理成消费者就绪的信息对象，并根据需要提供给消费者。为了应对这一挑战，我们提出了一个使用自然语言处理（NLP）来查询数据库技术并返回自然语言的响应的工作流程。我们的解决方案利用了大小的大型语言模型（LLM）来执行NLP以及图形数据库，这些数据库非常适合IOBT中普遍存在的动态连接网络。我们的体系结构采用LLM来绘制自然语言的映射问题，以对Cypher数据库查询，并以自然语言将数据库输出总结回用户。我们在数据库中评估了这两个任务的几个中型LLM，该数据库代表了美国陆军多功能传感区域（MSA）的公开数据，位于新墨西哥州拉斯克鲁塞斯的Jornada范围。我们观察到，美洲驼3.1（80亿个参数）在所有考虑的指标上都优于其他模型。最重要的是，我们注意到，与当前的方法不同，我们的两个步骤方法可以放松使用地面真相代码生成的Cypher查询的确切匹配（EM）要求，并且通过这种方式，它的准确性提高了19.4％。我们的工作流程为在边缘设备上部署LLM的地面工作奠定了基础，以使自然语言与包含信息对象的数据库进行自然语言互动以进行关键决策。

Title: Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs

Authors: Hongming Yang, Shi Lin, Jun Shao, Changting Lin, Donghai Zhu, Meng Han, Qinglei Kong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06401
Pdf URL: https://arxiv.org/pdf/2506.06401
Copy Paste: [[2506.06401]] Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs(https://arxiv.org/abs/2506.06401)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Lightweight Large Language Models (LwLLMs) are reduced-parameter, optimized models designed to run efficiently on consumer-grade hardware, offering significant advantages in resource efficiency, cost-effectiveness, and data privacy. However, these models often struggle with limited inference and reasoning capabilities, which restrict their performance on complex tasks and limit their practical applicability. Moreover, existing prompt optimization methods typically rely on extensive manual effort or the meta-cognitive abilities of state-of-the-art LLMs, making them less effective for LwLLMs. To address these challenges, we introduce DeBoP, a new Direct Behavior Optimization Paradigm, original from the Chain-of-Thought (CoT) prompting technique. Unlike CoT Prompting, DeBoP is an automatic optimization method, which focuses on the optimization directly on the behavior of LwLLMs. In particular, DeBoP transforms the optimization of complex prompts into the optimization of discrete, quantifiable execution sequences using a gradient-free Monte Carlo Tree Search. We evaluate DeBoP on seven challenging tasks where state-of-the-art LLMs excel but LwLLMs generally underperform. Experimental results demonstrate that DeBoP significantly outperforms recent prompt optimization methods on most tasks. In particular, DeBoP-optimized LwLLMs surpass GPT-3.5 on most tasks while reducing computational time by approximately 60% compared to other automatic prompt optimization methods.
摘要：轻巧的大语言模型（LWLLM）是减少参数，优化的模型，旨在有效地在消费级硬件上运行，在资源效率，成本效益和数据隐私方面具有显着优势。但是，这些模型通常会在推理和推理能力有限的情况下挣扎，从而限制了其在复杂任务上的性能并限制其实际适用性。此外，现有的迅速优化方法通常依赖于大量的手动努力或最先进的LLMS的元认知能力，从而使其对LWLLM的有效效果降低。为了应对这些挑战，我们介绍了DeBop，这是一种新的直接行为优化范式，原始的是Thebough（COT）提示技术。与COT提示不同，Debop是一种自动优化方法，它直接关注LWLLMS行为的优化。特别是，Debop使用无梯度的蒙特卡洛树搜索将复杂提示的优化转换为对离散的，可量化的执行序列的优化。我们评估了DEBOP，其中最新的LLM脱颖而出，但LWLLM通常表现不佳。实验结果表明，DEBOP在大多数任务上明显胜过最新的及时优化方法。特别是，与其他自动及时及时优化方法相比，大多数任务的Debop优化LWLLMS在大多数任务上超过GPT-3.5，同时将计算时间降低了约60％。

Title: Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Authors: Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06404
Pdf URL: https://arxiv.org/pdf/2506.06404
Copy Paste: [[2506.06404]] Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights(https://arxiv.org/abs/2506.06404)
Keywords: language model, llm
Abstract: The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the "black box" of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.
摘要：大语言模型（LLM）的应用范围继续扩展，从而增加了对与人类价值观保持一致的个性化LLM的兴趣。但是，将这些模型与个人值保持一致会引起重大的安全问题，因为某些值可能与有害信息相关。在本文中，我们确定了与价值一致的LLM相关的特定安全风险，并研究了这些挑战背后的心理原则。我们的发现揭示了两个关键见解。（1）与非最新调整模型相比，价值一致的LLM比其他微调模型更容易出现有害行为，并且在传统安全评估中的风险略高。（2）这些安全问题之所以出现，是因为价值与价值的LLM真正地根据一致性值生成文本，这可以扩大有害结果。使用具有详细安全类别的数据集，我们发现价值一致性和安全风险之间的显着相关性，并得到心理假设的支持。这项研究提供了对价值一致性“黑匣子”的见解，并提出了在上下文对准方法中提高价值与价值LLM的安全性。

Title: SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities

Authors: Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Chen Wei, Fangxiang Feng, Xiaojie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06406
Pdf URL: https://arxiv.org/pdf/2506.06406
Copy Paste: [[2506.06406]] SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities(https://arxiv.org/abs/2506.06406)
Keywords: language model
Abstract: Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.
摘要：专家（MOE）架构的混合已成为扩展大语言模型的关键方法，并越来越兴趣将其扩展到多模式任务。建立多模式MOE模型的现有方法要么会产生高训练成本，要么在调整预估计的模型时会遭受语言功能的退化。为了解决这个问题，我们提出了软模态路由（SMAR），这是一种新颖的正则化技术，它使用Kullback Leibler Divergence来控制跨模态的路由概率分布，鼓励专家专业化而不修改模型体系结构或严重依赖文本数据。视觉教学调整的实验表明，SMAR只有2.5％的纯文本保留语言能力为86.6％，在保持强大的多峰性能的同时，表现优于基线。我们的方法提供了一种实用有效的解决方案，以平衡多模式MOE模型中的模态差异化和语言能力。

Title: Canonical Autoregressive Generation

Authors: Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.06446
Pdf URL: https://arxiv.org/pdf/2506.06446
Copy Paste: [[2506.06446]] Canonical Autoregressive Generation(https://arxiv.org/abs/2506.06446)
Keywords: language model
Abstract: State of the art large language models are trained using large amounts of tokens derived from raw text using what is called a tokenizer. Crucially, the tokenizer determines the (token) vocabulary a model will use during inference as well as, in principle, the (token) language. This is because, while the token vocabulary may allow for different tokenizations of a string, the tokenizer always maps the string to only one of these tokenizations--the canonical tokenization. However, multiple lines of empirical evidence suggest that large language models do not always generate canonical token sequences, and this comes with several negative consequences. In this work, we first show that, to generate a canonical token sequence, a model needs to generate (partial) canonical token sequences at each step of the autoregressive generation process underpinning its functioning. Building upon this theoretical result, we introduce canonical sampling, a simple and efficient sampling method that precludes a given model from generating non-canonical token sequences. Further, we also show that, in comparison with standard sampling, the distribution of token sequences generated using canonical sampling is provably closer to the true distribution of token sequences used during training.
摘要：最先进的语言模型是使用使用称为令牌的大量代币来训练的。至关重要的是，令牌仪决定了模型在推理期间以及（代币）语言中使用的（令牌）词汇。这是因为，虽然令牌词汇可以允许字符串的不同令牌，但令牌始终将字符串映射到这些令牌之一，即规范的标记化。但是，多种经验证据表明，大型语言模型并不总是会产生规范令牌序列，这会带来几种负面后果。在这项工作中，我们首先表明，要生成一个规范的令牌序列，模型需要在自动回归生成过程的每个步骤中生成（部分）规范令牌序列，以支撑其功能。在理论结果的基础上，我们引入了规范采样，这是一种简单有效的抽样方法，它阻止给定模型产生非典型的令牌序列。此外，我们还表明，与标准采样相比，使用规范采样生成的令牌序列的分布更接近训练过程中使用的令牌序列的真实分布。

Title: What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models

Authors: Kaiser Sun, Fan Bai, Mark Dredze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06485
Pdf URL: https://arxiv.org/pdf/2506.06485
Copy Paste: [[2506.06485]] What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models(https://arxiv.org/abs/2506.06485)
Keywords: language model, llm
Abstract: Large language models frequently rely on both contextual input and parametric knowledge to perform tasks. However, these sources can come into conflict, especially when retrieved documents contradict the model's parametric knowledge. We propose a diagnostic framework to systematically evaluate LLM behavior under context-memory conflict, where the contextual information diverges from their parametric beliefs. We construct diagnostic data that elicit these conflicts and analyze model performance across multiple task types. Our findings reveal that (1) knowledge conflict has minimal impact on tasks that do not require knowledge utilization, (2) model performance is consistently higher when contextual and parametric knowledge are aligned, (3) models are unable to fully suppress their internal knowledge even when instructed, and (4) providing rationales that explain the conflict increases reliance on contexts. These insights raise concerns about the validity of model-based evaluation and underscore the need to account for knowledge conflict in the deployment of LLMs.
摘要：大型语言模型通常依靠上下文输入和参数知识来执行任务。但是，这些来源可能发生冲突，尤其是在检索文档与模型的参数知识相矛盾时。我们提出了一个诊断框架，以系统地评估上下文记忆冲突下的LLM行为，在上下文信息与其参数信念不同。我们构建诊断数据，以引起这些冲突并分析多种任务类型的模型性能。我们的发现表明，（1）知识冲突对不需要知识利用的任务的影响最小，（2）当对上下文和参数知识保持一致时，模型绩效始终较高，（3）模型即使在说明时也无法完全抑制其内部知识，并且（4）提供了解释冲突对上下文的依赖的合理性。这些见解引起了人们对基于模型的评估有效性的担忧，并强调了LLM部署中知识冲突的必要性。

Title: Improving LLM-Powered EDA Assistants with RAFT

Authors: Luyao Shi, Michael Kazda, Charles Schmitter, Hemlata Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06500
Pdf URL: https://arxiv.org/pdf/2506.06500
Copy Paste: [[2506.06500]] Improving LLM-Powered EDA Assistants with RAFT(https://arxiv.org/abs/2506.06500)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Electronic design engineers often struggle to efficiently access relevant information for tasks like design verification and technology development. While large language models (LLMs) can enhance productivity as conversational agents, pre-trained open-source LLMs lack domain-specific knowledge for Electronic Design Automation (EDA). In a Retrieval-Augmented Generation (RAG) context, LLMs rely on external context but may still produce inaccurate responses. Retrieval-Augmented Fine-Tuning (RAFT) improves LLM performance, but acquiring labeled question/answer (Q/A) data in EDA is difficult. To address this, we propose using synthetic Q/A datasets to enhance LLMs with RAFT. Our results show that RAFT with synthetic data significantly boosts LLM performance for RAG-based EDA tasks. We also investigate the impact of using real user questions as Retrieval-Augmented Few-Shot (RAFS) examples for synthetic data generation. Additionally, we implement secure access control to ensure sensitive information is only accessible to authorized personnel. Finally, we assess the risk of data leakage and unintended memorization during fine-tuning with synthetic data, providing practical insights.
摘要：电子设计工程师通常很难有效地访问相关信息，以完成设计验证和技术开发等任务。尽管大型语言模型（LLM）可以提高作为对话剂的生产率，但预培训的开源LLM缺乏针对电子设计自动化（EDA）的特定领域知识。在检索型生成（RAG）上下文中，LLMS依靠外部上下文，但仍可能产生不准确的响应。检索调查的微调（RAFT）提高了LLM的性能，但是在EDA中获取标记的问题/答案（Q/A）数据很困难。为了解决这个问题，我们建议使用合成Q/A数据集用木筏增强LLM。我们的结果表明，使用合成数据的筏显着提高了基于抹布的EDA任务的LLM性能。我们还调查了将真实用户问题作为合成数据生成的检索数量（RAFS）示例的影响。此外，我们实施安全的访问控制，以确保授权人员只能访问敏感信息。最后，我们评估了与合成数据进行微调过程中数据泄漏和意外记忆的风险，从而提供了实用的见解。

Title: Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes

Authors: Kshitish Ghate, Tessa Charlesworth, Mona Diab, Aylin Caliskan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06506
Pdf URL: https://arxiv.org/pdf/2506.06506
Copy Paste: [[2506.06506]] Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes(https://arxiv.org/abs/2506.06506)
Keywords: language model
Abstract: To build fair AI systems we need to understand how social-group biases intrinsic to foundational encoder-based vision-language models (VLMs) manifest in biases in downstream tasks. In this study, we demonstrate that intrinsic biases in VLM representations systematically ``carry over'' or propagate into zero-shot retrieval tasks, revealing how deeply rooted biases shape a model's outputs. We introduce a controlled framework to measure this propagation by correlating (a) intrinsic measures of bias in the representational space with (b) extrinsic measures of bias in zero-shot text-to-image (TTI) and image-to-text (ITT) retrieval. Results show substantial correlations between intrinsic and extrinsic bias, with an average $\rho$ = 0.83 $\pm$ 0.10. This pattern is consistent across 114 analyses, both retrieval directions, six social groups, and three distinct VLMs. Notably, we find that larger/better-performing models exhibit greater bias propagation, a finding that raises concerns given the trend towards increasingly complex AI models. Our framework introduces baseline evaluation tasks to measure the propagation of group and valence signals. Investigations reveal that underrepresented groups experience less robust propagation, further skewing their model-related outcomes.
摘要：为了构建公平的AI系统，我们需要了解社会群体对基础编码器的视觉模型（VLM）的内在偏见如何在下游任务的偏见中体现出来。在这项研究中，我们证明了VLM表示中的内在偏见有系统地``''''或传播到零射击的检索任务中，从而揭示了根源深层的偏见如何影响模型的输出。我们引入了一个受控的框架，通过将（a）在代表空间中的内在偏差与（b）零摄像的偏置零摄像的偏置度量相关联（a）零摄像的文本对图像（TTI）和图像对文本（ITT）检索中的外部偏置度量。结果表明，内在偏见和外在偏见之间存在很大的相关性，平均$ \ rho $ = 0.83 $ \ pm $ 0.10。在114个分析，检索方向，六个社会群体和三个不同的VLM中，这种模式是一致的。值得注意的是，我们发现较大/绩效的模型表现出更大的偏见传播，这一发现引起了人们对日益复杂的AI模型的趋势的关注。我们的框架引入了基线评估任务，以衡量群体和价信号的传播。调查表明，代表性不足的群体的传播较低，进一步扭曲了与模型相关的结果。

Title: Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, Holger Boche
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06522
Pdf URL: https://arxiv.org/pdf/2506.06522
Copy Paste: [[2506.06522]] Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance(https://arxiv.org/abs/2506.06522)
Keywords: language model, llm
Abstract: Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.
摘要：关于大型语言模型（LLM）的最新工作越来越重点是培训后和与策划的数据集进行对齐，以增强以下教学，世界知识和专业技能。但是，用于领导开放和封闭源LLM的大多数用于公众开放和封闭源LLM的培训数据集仍然无法访问其施工过程。这种缺乏透明度的促进了近期开源后培训后语料库的发展。尽管对这些开放式替代方案进行培训可以产生与领先模型相当的性能，但由于严格的计算成本在大规模进行大规模进行，因此系统的比较仍然具有挑战性，因此在很大程度上不存在。结果，目前尚不清楚特定的样本，任务类型或策略在评估数据质量时如何影响下游性能。在这项工作中，我们对两个突出的开放式培训数据集进行了首次全面的并排分析：Tulu-3-Sft-Mix和Smoltalk。使用Magpie框架，我们用详细的质量指标来注释每个样本，包括转弯结构（单转弯与多转弯），任务类别，输入质量和响应质量，我们得出了统计数据，这些统计数据揭示了两个数据集之间的结构和定性相似性和差异。基于这些见解，我们设计了一种有原则的策划食谱，该配方会产生一种新的数据混合物Tulutalk，该食谱在匹配或超过其在关键基准测试上的性能时包含的样本比任何一个源数据集少了14％。我们的发现提供了可行的见解，用于构建更有效的培训后数据集，以在实际资源限制内改善模型性能。为了支持未来的研究，我们公开发布了带注释的源数据集和我们精选的Tulutalk混合物。

Title: Beyond Facts: Evaluating Intent Hallucination in Large Language Models

Authors: Yijie Hao, Haofei Yu, Jiaxuan You
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06539
Pdf URL: https://arxiv.org/pdf/2506.06539
Copy Paste: [[2506.06539]] Beyond Facts: Evaluating Intent Hallucination in Large Language Models(https://arxiv.org/abs/2506.06539)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: When exposed to complex queries containing multiple conditions, today's large language models (LLMs) tend to produce responses that only partially satisfy the query while neglecting certain conditions. We therefore introduce the concept of Intent Hallucination. In this phenomenon, LLMs either omit (neglecting to address certain parts) or misinterpret (responding to invented query parts) elements of the given query, leading to intent hallucinated generation. To systematically evaluate intent hallucination, we introduce FAITHQA, a novel benchmark for intent hallucination that contains 20,068 problems, covering both query-only and retrieval-augmented generation (RAG) setups with varying topics and difficulty. FAITHQA is the first hallucination benchmark that goes beyond factual verification, tailored to identify the fundamental cause of intent hallucination. By evaluating various LLMs on FAITHQA, we find that (1) intent hallucination is a common issue even for state-of-the-art models, and (2) the phenomenon stems from omission or misinterpretation of LLMs. To facilitate future research, we introduce an automatic LLM generation evaluation metric, CONSTRAINT SCORE, for detecting intent hallucination. Human evaluation results demonstrate that CONSTRAINT SCORE is closer to human performance for intent hallucination compared to baselines.
摘要：当暴露于包含多种条件的复杂查询时，当今的大语言模型（LLM）倾向于产生仅在忽略某些条件的同时部分满足查询的响应。因此，我们介绍了意图幻觉的概念。在这种现象中，llms要么省略（忽略解决某些部分）或误解（响应给定查询的发明的查询部分）元素，从而导致意图幻觉。为了系统地评估意图幻觉，我们介绍了FaithQa，这是一种新颖的意图幻觉基准，其中包含20,068个问题，涵盖了具有不同主题和难度的仅查询和检索增强的一代（RAG）设置（RAG）设置。 FaithQA是第一个超越事实验证的幻觉基准，该基准量身定制，旨在确定意图幻觉的基本原因。通过评估FaithQA上的各种LLM，我们发现（1）意图幻觉即使对于最先进的模型也是一个普遍的问题，以及（2）现象是由于遗漏或误解了LLM。为了促进未来的研究，我们引入了自动LLM生成评估度量，约束评分，以检测意图幻觉。人类评估结果表明，与基线相比，与意图幻觉相比，约束得分更接近人类的绩效。

Title: LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

Authors: Ho Yin 'Sam' Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao 'Kenneth' Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06561
Pdf URL: https://arxiv.org/pdf/2506.06561
Copy Paste: [[2506.06561]] LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles(https://arxiv.org/abs/2506.06561)
Keywords: language model, llm
Abstract: Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
摘要：图字幕对于帮助读者理解和记住图的关键信息至关重要。已经开发了许多模型来生成这些字幕，帮助作者更轻松地制作更好的质量标题。但是，作者几乎总是需要修改通用AI生成的字幕，以匹配其写作风格和域名的风格，从而强调了对个性化的需求。尽管语言模型的个性化（LAMP）进展，但这些技术通常专注于仅文本设置，并且很少解决输入和配置文件都是多模式的方案。本文介绍了灯罩，这是一个具有多模式图谱的个性化图形字幕的数据集。对于每个目标图，Lamp-CAP不仅提供所需的输入，例如图像图像，还提供了来自同一文档的其他三个图形（及其图像，标题和图形段落），以表征上下文。使用四个LLM的实验表明，使用配置信息信息一致地有助于使字幕更接近原始作者编写的字幕。消融研究表明，配置文件中的图像比图形段落更有用，这突出了使用多模式轮廓而不是仅文本图像的优势。

Title: Precise Information Control in Long-Form Text Generation

Authors: Jacqueline He, Howard Yen, Margaret Li, Shuyue Stella Li, Zhiyuan Zeng, Weijia Shi, Yulia Tsvetkov, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06589
Pdf URL: https://arxiv.org/pdf/2506.06589
Copy Paste: [[2506.06589]] Precise Information Control in Long-Form Text Generation(https://arxiv.org/abs/2506.06589)
Keywords: language model, hallucination
Abstract: A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.
摘要：现代语言模型（LMS）中的一个核心挑战是固有的幻觉：相对于输入上下文的合理但未经证实的信息的产生。为了研究此问题，我们提出了一种精确的信息控制（PIC），这是一种新的任务公式，需要模型来生成基于提供的简短的自包式陈述，即可验证的索赔，而无需添加任何不支持的陈述。为了获得综合性，PIC包括一个完整的设置，该设置可以测试模型的精确包含所有输入索赔的能力，以及一个需要模型仅选择性合并相关索赔的部分设置。我们提出了Pic-Bench，这是八个长形成任务的基准（例如，摘要，传记生成），适用于PIC设置，其中LMS提供了良好的，可验证的输入声明。我们对PIC板台上一系列开放和专有LMS的评估表明，令人惊讶的是，最先进的LMS仍在本质上幻觉，超过70％。为了减轻这种缺乏忠诚，我们使用弱监督的偏好数据构建方法介绍了一个训练后框架，以训练具有更强的PIC能力的8B PIC-LM - 在完整的PIC设置中从69.1％到91.0％F1。当整合到端到端的事实生成管道中时，PIC-LM在模棱两可的质量上提高了17.1％的回忆，并在出生场所验证任务中提高了30.5％的效果，从而强调了精确扎根的生成的潜力。

Title: MedCite: Can Language Models Generate Verifiable Text for Medicine?

Authors: Xiao Wang, Mengjue Tan, Qiao Jin, Guangzhi Xiong, Yu Hu, Aidong Zhang, Zhiyong Lu, Minjia Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06605
Pdf URL: https://arxiv.org/pdf/2506.06605
Copy Paste: [[2506.06605]] MedCite: Can Language Models Generate Verifiable Text for Medicine?(https://arxiv.org/abs/2506.06605)
Keywords: language model, llm
Abstract: Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.
摘要：现有的基于LLM的医疗问答系统缺乏引文的产生和评估能力，从而引起了对其在实践中采用的担忧。在这项工作中，我们介绍了\ Name，这是第一个端到端的框架，可促进使用LLMS用于医疗任务的引文生成的设计和评估。同时，我们介绍了一种新型的多通录录引文方法，该方法产生了高质量的引用。我们的评估强调了对医疗任务的引文生成的挑战和机会，同时确定了对最终引文质量产生重大影响的重要设计选择。与强大的基线方法相比，我们提出的方法可实现出色的引文精度和召回改进，我们表明评估结果与专业专家的注释结果良好相关。

Title: Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Authors: Charles Goddard, Fernando Fernandes Neto
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06607
Pdf URL: https://arxiv.org/pdf/2506.06607
Copy Paste: [[2506.06607]] Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit(https://arxiv.org/abs/2506.06607)
Keywords: language model, llm
Abstract: We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--Llama$\to$Mistral NeMo (12B) and Qwen$\to$Llama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.
摘要：我们提出了一种无训练的方法，通过通过正交匹配的追踪（OMP）重建未看到的令牌嵌入，以预处理的大语言模型（LLM）移植引物。具体而言，我们将每个Vocabulary代币视为共享代币的稀疏线性组合，分为两个阶段：首先，计算每个新的令牌在供体嵌入空间中的表示，并用共享的锚点代币的小词典，然后将这些相同的稀疏系数转移到基础模型的范围内。在两项具有挑战性的跨态任务中 - llama $ \ to $ mistral nemo（12b）和qwen $ \ to $ llama（1b） - 我们表明，OMP可以在多个基准测试中实现最佳的零拍摄性能，而其他零局部的方法则显着降低。与基线（零内部，卑鄙的定位和诸如Wechsel，Focus，Zett）之类的现有方法相比，OPM始终取得了最佳的整体性能，有效地弥合了大型令牌差异，而没有梯度更新。我们的分析进一步将不匹配的数值令牌方案视为维护数学推理能力的关键挑战。该技术可直接使用新的引导器重复使用验证的模型权重，从而促进交叉式知识蒸馏，投机解码，结合，合并和特定于域的词汇适应。我们将方法集成到开源合并 - tokensurgeon工具中，以进行事后词汇调整。

Title: Transferring Features Across Language Models With Model Stitching

Authors: Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06609
Pdf URL: https://arxiv.org/pdf/2506.06609
Copy Paste: [[2506.06609]] Transferring Features Across Language Models With Model Stitching(https://arxiv.org/abs/2506.06609)
Keywords: language model
Abstract: In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
摘要：在这项工作中，我们证明了剩余的语言模型流之间的仿射映射是一种有效传输模型之间特征的廉价方式。我们将此技术应用于不同大小的模型之间的稀疏自动编码器（SAE）的权重，以比较其表示形式。我们发现，小型和大型模型学会了高度相似的表示空间，这激发了较小型号上的较高培训组件，例如SAE，并以拖放节省的方式转移到较大的型号。例如，在较大型号上训练SAE时，使用小到一大到一大的SAE作为初始化可能会导致50％的训练运行。接下来，我们表明转移的探针和转向向量可以有效地恢复地面真相表现。最后，我们更深入地研究了特征级别的可传递性，发现语义和结构特征的转移明显不同，而特定类别的功能特征则忠实地绘制了其角色。总体而言，我们的发现说明了小型和大型模型的线性表示空间的相似性和差异，并展示了提高SAE训练效率的方法。

Title: Interpretable Depression Detection from Social Media Text Using LLM-Derived Embeddings

Authors: Samuel Kim, Oghenemaro Imieye, Yunting Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06616
Pdf URL: https://arxiv.org/pdf/2506.06616
Copy Paste: [[2506.06616]] Interpretable Depression Detection from Social Media Text Using LLM-Derived Embeddings(https://arxiv.org/abs/2506.06616)
Keywords: language model, llm
Abstract: Accurate and interpretable detection of depressive language in social media is useful for early interventions of mental health conditions, and has important implications for both clinical practice and broader public health efforts. In this paper, we investigate the performance of large language models (LLMs) and traditional machine learning classifiers across three classification tasks involving social media data: binary depression classification, depression severity classification, and differential diagnosis classification among depression, PTSD, and anxiety. Our study compares zero-shot LLMs with supervised classifiers trained on both conventional text embeddings and LLM-generated summary embeddings. Our experiments reveal that while zero-shot LLMs demonstrate strong generalization capabilities in binary classification, they struggle with fine-grained ordinal classifications. In contrast, classifiers trained on summary embeddings generated by LLMs demonstrate competitive, and in some cases superior, performance on the classification tasks, particularly when compared to models using traditional text embeddings. Our findings demonstrate the strengths of LLMs in mental health prediction, and suggest promising directions for better utilization of their zero-shot capabilities and context-aware summarization techniques.
摘要：社交媒体中对抑郁语言的准确检测对于早期干预心理健康状况非常有用，并且对临床实践和更广泛的公共卫生努力都具有重要意义。在本文中，我们研究了涉及社交媒体数据的三个分类任务中大型语言模型（LLM）和传统的机器学习分类器的性能：二进制抑郁症分类，抑郁症严重性分类以及抑郁症，PTSD和焦虑症之间的差异诊断分类。我们的研究将零拍的LLM与受过监督的分类器进行了比较，该分类器对传统的文本嵌入和LLM生成的摘要嵌入培训。我们的实验表明，虽然零射门LLM在二元分类中表现出强大的概括能力，但它们在细粒度的顺序分类中挣扎。相比之下，在LLMS生成的摘要嵌入培训的分类器表现出竞争性，在某些情况下，分类任务的性能优越，尤其是与使用传统文本嵌入的模型相比。我们的发现证明了LLM在心理健康预测中的优势，并提出了有希望的方向，以更好地利用其零击功能和上下文感知的摘要技术。

Title: BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs

Authors: Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06619
Pdf URL: https://arxiv.org/pdf/2506.06619
Copy Paste: [[2506.06619]] BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs(https://arxiv.org/abs/2506.06619)
Keywords: language model, llm
Abstract: A core part of legal work that has been under-explored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today's large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.
摘要：法律NLP法律探索的法律工作的核心部分是法律摘要的撰写和编辑。这不仅需要对司法管辖区法律的彻底理解，从判决到法规，而且还需要提出新论点来试图以新的方向扩大法律并提出对法官有说服力的新颖和创造性论点的能力。为了捕获和评估语言模型中的这些法律技能，我们介绍了一个新的数据集Hillme，该数据集专注于法律简介。它包含了语言模型的三个任务，以协助法律专业人士书面摘要：论证摘要，论证完成和案例检索。在这项工作中，我们描述了这些任务的创建，分析它们并显示当前模型的执行方式。我们看到，今天的大型语言模型（LLM）已经擅长摘要和指导的完成任务，甚至击败了人类生成的标题。然而，他们在我们的基准中的其他任务上表现不佳：现实的论证完成和检索相关的法律案件。我们希望该数据集通过特殊帮助人们进行法律工作的方式来鼓励法律NLP的更多发展。

Title: Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations

Authors: Junzhe Wang, Bichen Wang, Xing Fu, Yixin Sun, Yanyan Zhao, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06626
Pdf URL: https://arxiv.org/pdf/2506.06626
Copy Paste: [[2506.06626]] Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations(https://arxiv.org/abs/2506.06626)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have made significant progress in automated psychological counseling. However, current research focuses on single-session counseling, which doesn't represent real-world scenarios. In practice, psychological counseling is a process, not a one-time event, requiring sustained, multi-session engagement to progressively address clients' issues. To overcome this limitation, we introduce a dataset for Multi-Session Psychological Counseling Conversation Dataset (MusPsy-Dataset). Our MusPsy-Dataset is constructed using real client profiles from publicly available psychological case reports. It captures the dynamic arc of counseling, encompassing multiple progressive counseling conversations from the same client across different sessions. Leveraging our dataset, we also developed our MusPsy-Model, which aims to track client progress and adapt its counseling direction over time. Experiments show that our model performs better than baseline models across multiple sessions.
摘要：近年来，大型语言模型（LLMS）在自动心理咨询方面取得了重大进展。但是，当前的研究重点是单课咨询，这并不代表现实情况。实际上，心理咨询是一个过程，而不是一次性事件，需要持续的多课程参与来逐步解决客户的问题。为了克服这一限制，我们引入了一个用于多课程心理咨询对话数据集（Muspsy-Dataset）的数据集。我们的Muspsy-Dataset是使用公开可用的心理病例报告中的真实客户概况构建的。它捕获了咨询的动态弧线，包括来自不同会议的同一客户的多次渐进式咨询对话。利用我们的数据集，我们还开发了我们的Muspsy模型，该模型旨在跟踪客户的进度并随着时间的推移调整其咨询方向。实验表明，我们的模型在多个会话中的性能优于基线模型。

Title: SafeLawBench: Towards Safe Alignment of Large Language Models

Authors: Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Juntao Dai, Yaodong Yang, Sirui Han, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06636
Pdf URL: https://arxiv.org/pdf/2506.06636
Copy Paste: [[2506.06636]] SafeLawBench: Towards Safe Alignment of Large Language Models(https://arxiv.org/abs/2506.06636)
Keywords: language model, gpt, llm, prompt
Abstract: With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs' safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs' safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8\%. We urge the community to prioritize research on the safety of LLMs.
摘要：随着大语言模型（LLM）的日益普遍性，LLM的安全引起了重大关注。但是，由于当前安全基准的主观性质，仍然缺乏评估其安全性的确定标准。为了解决这一差距，我们通过提出Safelawbench基准，从法律角度进行了首次探索LLMS的安全评估。 Safelawbench根据法律标准将安全风险分为三个级别，为评估提供了系统的全面框架。它包含24,860个多项选择问题和1,106个开放域问答（QA）任务。我们的评估包括使用零射击和少量发动机提示的2个封闭源LLM和18个开源LLMS，突出了每种型号的安全功能。我们还评估了LLMS与安全相关的推理稳定性和拒绝行为。此外，我们发现多数投票机制可以增强模型性能。值得注意的是，即使是Claude-3.5-Sonnet和GPT-4O等领先的SOTA模型，在Safelawbench上的多项选择任务中也没有超过80.5％的精度，而20个LLMS的平均准确度仍为68.8 \％。我们敦促社区优先考虑有关LLMS安全性的研究。

Title: Quantile Regression with Large Language Models for Price Prediction

Authors: Nikhita Vedula, Dushyanta Dhyani, Laleh Jalali, Boris Oreshkin, Mohsen Bayati, Shervin Malmasi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06657
Pdf URL: https://arxiv.org/pdf/2506.06657
Copy Paste: [[2506.06657]] Quantile Regression with Large Language Models for Price Prediction(https://arxiv.org/abs/2506.06657)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods. We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical. We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration. Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods. Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at this https URL to support future research.
摘要：大型语言模型（LLMS）在结构化预测任务（包括回归）中表现出了希望，但现有方法主要集中于点估计，并且缺乏不同方法的系统比较。我们使用LLMS进行非结构化输入调查概率回归，以解决挑战性的文本到分数预测任务，例如价格估计，其中细微的文本理解和不确定性量化都是至关重要的。我们提出了一种新型的分位回归方法，使LLM能够产生完整的预测分布，从而改善传统点估计值。通过对三个不同价格预测数据集进行的广泛实验，我们证明了一个带有分位数的Mistral-7b模型对点和分布估计的传统方法显着优于传统方法，这是通过三个已建立的指标来衡量的，以预测准确性和分配校准。我们对LLM方法，模型体系结构，培训方法和数据扩展的系统比较表明，Mistral-7b始终优于编码器体系结构，基于嵌入的方法和少量学习方法。我们的实验还揭示了LLM辅助标签校正在没有系统偏见的情况下实现人级准确性方面的有效性。我们的策划数据集可在此HTTPS URL上提供，以支持未来的研究。

Title: Learning Distribution-Wise Control in Representation Space for Language Models

Authors: Chunyuan Deng, Ruidi Chang, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06686
Pdf URL: https://arxiv.org/pdf/2506.06686
Copy Paste: [[2506.06686]] Learning Distribution-Wise Control in Representation Space for Language Models(https://arxiv.org/abs/2506.06686)
Keywords: language model
Abstract: Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \href{this https URL}{this https URL}.
摘要：语言模型（LMS）的干预措施从策略性地应用于向前传球期间的模型行为。可学习的干预措施，也称为微调，旨在在概念子空间内进行点状控制，并证明有效地改变了高级行为。在这项工作中，我们将这种方法扩展到分布级别，使模型不仅可以刻薄地转换，还可以学习概念子空间的周围区域。我们证明这些方法在早期层中有效地性能，并且较大的标准偏差与改善的性能密切相关。在八个常识性推理和七个算术推理基准中，我们的分配干预措施始终在可控性和鲁棒性方面的干预措施始终优于干预措施。这些结果表明，通过分配的干预措施为指导模型行为提供了更全面的方法，并可以对语言模型进行更精细的控制。代码为：\ href {this HTTPS url} {this HTTPS url}。

Title: Dynamic and Parametric Retrieval-Augmented Generation

Authors: Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, Yiqun Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.06704
Pdf URL: https://arxiv.org/pdf/2506.06704
Copy Paste: [[2506.06704]] Dynamic and Parametric Retrieval-Augmented Generation(https://arxiv.org/abs/2506.06704)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.
摘要：检索提示的一代（RAG）已成为为大型语言模型（LLM）配备外部知识的基础范式，在信息检索和知识密集型应用程序中发挥着关键作用。但是，传统的抹布系统通常采用静态检索 - 然后再生的管道并依靠在内部的知识注入，这对于需要多ihop推理，自适应信息访问以及更深入的外部知识的复杂任务可能是最佳的。受这些局限性的激励，研究界已经超越了静态检索和内在知识注入。在新兴方向中，该教程分为两个关于抹布的快速增长和互补的研究领域：动态抹布和参数抹布。动态RAG自适应地确定在LLM的生成过程中何时以及要检索什么，从而实现了对LLM不断发展的信息需求的实时适应。参数rag重新考虑了应如何将检索到的知识注入LLM，从输入级别到参数级知识注入，以提高效率和有效性。本教程对这些新兴研究领域的最新进展进行了全面概述。它还分享了理论基础和实用见解，以支持和激发抹布的进一步研究。

Title: DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains

Authors: Zhihui Chen, Kai He, Yucheng Huang, Yunxiao Zhu, Mengling Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06705
Pdf URL: https://arxiv.org/pdf/2506.06705
Copy Paste: [[2506.06705]] DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains(https://arxiv.org/abs/2506.06705)
Keywords: llm
Abstract: Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. We also release a domain-specific benchmark for LLM-generated text detection in the medical and legal domains. Experiments on our benchmark show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall (0.1% false positive rate threshold). In adversarial settings, DivScore demonstrates superior robustness than other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall. Code and data are publicly available.
摘要：在医学和法律等专业和高风险领域中检测LLM生成的文本对于打击错误的信息和确保真实性至关重要。但是，当前的零射击检测器虽然对一般文本有效，但由于域移位而应用于专用内容时，通常会失败。我们提供了理论分析，表明这种失败从根本上与人，检测器和源文本分布之间的KL差异有关。为了解决这个问题，我们提出了DivScore，这是一个使用归一化熵的评分和域知识蒸馏的零射击检测框架，以鲁棒性地识别专用域中的LLM生成的文本。我们还发布了特定领域的基准，用于在医疗和法律领域中为LLM生成的文本检测。我们的基准测试的实验表明，DivScore始终优于最先进的检测器，AUROC高14.4％，召回率高64.0％（0.1％的假阳性率阈值）。在对抗环境中，DivScore比其他基线表现出优异的鲁棒性，平均在AUROC中获得了22.8％的优势，在召回率中平均获得了29.5％的优势。代码和数据公开可用。

Title: C-PATH: Conversational Patient Assistance and Triage in Healthcare System

Authors: Qi Shi, Qiwei Han, Cláudia Soares
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06737
Pdf URL: https://arxiv.org/pdf/2506.06737
Copy Paste: [[2506.06737]] C-PATH: Conversational Patient Assistance and Triage in Healthcare System(https://arxiv.org/abs/2506.06737)
Keywords: language model, gpt, llm
Abstract: Navigating healthcare systems can be complex and overwhelming, creating barriers for patients seeking timely and appropriate medical attention. In this paper, we introduce C-PATH (Conversational Patient Assistance and Triage in Healthcare), a novel conversational AI system powered by large language models (LLMs) designed to assist patients in recognizing symptoms and recommending appropriate medical departments through natural, multi-turn dialogues. C-PATH is fine-tuned on medical knowledge, dialogue data, and clinical summaries using a multi-stage pipeline built on the LLaMA3 architecture. A core contribution of this work is a GPT-based data augmentation framework that transforms structured clinical knowledge from DDXPlus into lay-person-friendly conversations, allowing alignment with patient communication norms. We also implement a scalable conversation history management strategy to ensure long-range coherence. Evaluation with GPTScore demonstrates strong performance across dimensions such as clarity, informativeness, and recommendation accuracy. Quantitative benchmarks show that C-PATH achieves superior performance in GPT-rewritten conversational datasets, significantly outperforming domain-specific baselines. C-PATH represents a step forward in the development of user-centric, accessible, and accurate AI tools for digital health assistance and triage.
摘要：导航医疗系统可能是复杂而压倒性的，为寻求及时，适当的医疗护理的患者造成了障碍。在本文中，我们介绍了C-Path（在医疗保健中进行对话的患者援助和分类），这是一种由大型语言模型（LLMS）提供动力的新型会话AI系统，旨在帮助患者识别症状并通过自然的多扭转对话来推荐适当的医疗部门。 C-Path使用基于Llama3体系结构建立的多阶段管道对医学知识，对话数据和临床摘要进行了微调。这项工作的核心贡献是基于GPT的数据增强框架，该框架将结构化的临床知识从DDXPLUS转化为外行人友好的对话，从而使与患者的沟通规范保持一致。我们还实施了可扩展的对话历史管理策略，以确保长期连贯性。使用GPTSCORE的评估表明，诸如清晰度，信息性和建议准确性等方面的性能很强。定量基准表明，C-Path在GPT划分的对话数据集中达到了卓越的性能，从而超过了特定于域的基层。 C-Path代表了开发以用户为中心，可访问和准确的AI工具用于数字卫生援助和分类的一步。

Title: Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models

Authors: Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06751
Pdf URL: https://arxiv.org/pdf/2506.06751
Copy Paste: [[2506.06751]] Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models(https://arxiv.org/abs/2506.06751)
Keywords: language model, llm, prompt
Abstract: This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models' sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.
摘要：本文评估了LLM中各个国家的地缘政治偏见，尽管分析了他们对国家，英国，英国，苏联和中国对历史事件的解释的分析。我们介绍了一个具有中性事件描述的新型数据集和来自不同国家的对比观点。我们的发现显示了巨大的地缘政治偏见，模型有利于特定的民族叙事。此外，简单的伪造提示在减少这些偏见方面的效果有限。操纵参与者标签的实验揭示了模型对归因的敏感性，有时会放大偏见或识别不一致之处，尤其是在交换标签上。这项工作突出了LLM中的民族叙事偏见，挑战了简单的证词方法的有效性，并为未来的地缘政治偏见研究提供了框架和数据集。

Title: They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse

Authors: Walter Paci (1), Alessandro Panunzi (1), Sandro Pezzelle (2) ((1) University of Florence, (2) University of Amsterdam)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06775
Pdf URL: https://arxiv.org/pdf/2506.06775
Copy Paste: [[2506.06775]] They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse(https://arxiv.org/abs/2506.06775)
Keywords: language model, llm
Abstract: Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at this https URL
摘要：隐式内容在政治话语中起着至关重要的作用，在政治话语中，说话者系统地采用了务实的策略，例如牵连和预设来影响他们的受众。大型语言模型（LLMS）在需要复杂的语义和务实理解的任务中表现出强大的表现，突出了它们检测和解释隐式内容含义的潜力。但是，他们在政治话语中做到这一点的能力在很大程度上仍然没有得到充实。我们首次利用大型Impaqts语料库，包括意大利政治演讲，并通过操纵隐式内容的注释，我们提出了测试LLMS在这个挑战性问题中的有效性的方法。通过多项选择任务和开放式生成任务，我们证明了所有经过测试的模型都难以解释预设和含义。我们得出的结论是，当前的LLM缺乏准确解释高度隐含语言所需的关键务实能力，例如在政治话语中发现的语言。同时，我们重点介绍了提高模型性能的有希望的趋势和未来方向。我们在此HTTPS URL上发布数据和代码

Title: On the Adaptive Psychological Persuasion of Large Language Models

Authors: Tianjie Ju, Yujia Chen, Hao Fei, Mong-Li Lee, Wynne Hsu, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06800
Pdf URL: https://arxiv.org/pdf/2506.06800
Copy Paste: [[2506.06800]] On the Adaptive Psychological Persuasion of Large Language Models(https://arxiv.org/abs/2506.06800)
Keywords: language model, llm
Abstract: Previous work has showcased the intriguing capabilities of Large Language Models (LLMs) in instruction-following and rhetorical fluency. However, systematic exploration of their dual capabilities to autonomously persuade and resist persuasion, particularly in contexts involving psychological rhetoric, remains unexplored. In this paper, we first evaluate four commonly adopted LLMs by tasking them to alternately act as persuaders and listeners in adversarial dialogues. Empirical results show that persuader LLMs predominantly employ repetitive strategies, leading to low success rates. Then we introduce eleven comprehensive psychological persuasion strategies, finding that explicitly instructing LLMs to adopt specific strategies such as Fluency Effect and Repetition Effect significantly improves persuasion success rates. However, no ``one-size-fits-all'' strategy proves universally effective, with performance heavily dependent on contextual counterfactuals. Motivated by these observations, we propose an adaptive framework based on direct preference optimization that trains LLMs to autonomously select optimal strategies by leveraging persuasion results from strategy-specific responses as preference pairs. Experiments on three open-source LLMs confirm that the proposed adaptive psychological persuasion method effectively enables persuader LLMs to select optimal strategies, significantly enhancing their success rates while maintaining general capabilities. Our code is available at this https URL.
摘要：先前的工作展示了大语言模型（LLMS）在遵循教学和修辞流利方面的吸引力。但是，对他们自主说服和抵抗说服的双重能力的系统探索，尤其是在涉及心理言论的情况下，仍然没有得到探索。在本文中，我们首先要评估四个通常采用的LLM，以交替担任对抗对话中的说服者和听众。经验结果表明，说服力LLMS主要采用重复策略，导致成功率低。然后，我们引入了11种全面的心理说服策略，发现明确指示LLM采用特定策略（例如流利效应和重复效应）可显着提高说服力的成功率。但是，没有``一定程度''策略证明了普遍有效的效果，并且在很大程度上取决于上下文反事实。在这些观察结果的推动下，我们提出了一个基于直接偏好优化的自适应框架，该框架通过利用说服力来训练LLMS自主选择最佳策略，这是由特定于策略特定的响应作为偏好对的。在三个开源LLMS上进行的实验证实，提出的自适应心理说服方法有效地使LLMS可以选择最佳策略，从而显着提高其成功率，同时保持一般能力。我们的代码可在此HTTPS URL上找到。

Title: Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

Authors: James A. Michaelov, Reeka Estacio, Zhien Zhang, Benjamin K. Bergen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06808
Pdf URL: https://arxiv.org/pdf/2506.06808
Copy Paste: [[2506.06808]] Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events(https://arxiv.org/abs/2506.06808)
Keywords: language model
Abstract: Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.
摘要：语言模型能否可靠地预测可能的事件比仅仅是不可能的事件？通过嘲笑可能性，典型性和上下文相关性，我们表明，尽管有先前的工作的结果，但语言模型的执行能力远非强大。实际上，在某些条件下，所有经过测试的模型（包括Llama 3，Gemma 2和Mistral Nemo）的表现较差，比较差的一级表现，为不可能的句子（例如，刹车给停车票给了诸如“遇到的汽车”的停车票相比，诸如“汽车被验证者给探险家给汽车的停车票）的概率更高。

Title: Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Authors: Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor, Nancy F. Chen, Ai Ti Aw
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.06820
Pdf URL: https://arxiv.org/pdf/2506.06820
Copy Paste: [[2506.06820]] Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs(https://arxiv.org/abs/2506.06820)
Keywords: language model, llm
Abstract: Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
摘要：音频大语言模型（Audiollms）在语音识别和翻译等语义任务中取得了强大的成果，但在建模诸如情感之类的副语言提示中仍然有限。现有方法通常将情绪理解视为分类问题，几乎没有深入了解预测背后的基本原理。在这项工作中，我们探讨了情感推理，这种策略利用了众多声音的生成能力来通过产生语义上的，循证的解释来增强情绪识别。为了在多任务录音中支持这一点，我们引入了一个统一的框架，结合了推理提示的数据监督，双重编码器体系结构和任务范围的培训。这种方法使Audiollms能够在结合情感推理的同时有效地学习不同的任务。关于Iemocap和MELD的实验表明，我们的方法不仅提高了情绪预测的准确性，而且还提高了所产生的响应的相干性和证据基础。

Title: Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Authors: Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2506.06821
Pdf URL: https://arxiv.org/pdf/2506.06821
Copy Paste: [[2506.06821]] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems(https://arxiv.org/abs/2506.06821)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
摘要：大型语言模型（LLMS）在代码生成中表现出显着的功能，能够在推理过程中解决复杂的任务。但是，可以在测试案例生成中使用LLMS进行代码检查或调试的程度，这在很大程度上尚未探索。我们从竞争级编程（CP）计划的角度研究了这个问题，并提出了TCGBENCH，这是（LLM生成）测试案例生成器的基准。该基准分别包括两个任务，旨在研究（1）在（1）为给定CP问题生成有效的测试案例生成器中的LLM的功能，而进一步（2）生成针对性的测试案例生成器，这些测试案例生成器以人为编号的代码暴露了错误。实验结果表明，尽管最先进的LLM可以在大多数情况下生成有效的测试案例发生器，但大多数LLM都难以生成有效揭示人类法规中缺陷的目标测试案例。特别是，即使是先进的推理模型（例如O3-Mini），在产生目标发电机的任务中，人类绩效的效果大大不足。此外，我们构建了一个高质量的，手动策划的指令数据集，用于生成目标发电机。分析表明，通过提示和微调，可以在该数据集的帮助下增强LLM的性能。

Title: PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation

Authors: Arkadiusz Modzelewski, Witold Sosnowski, Tiziano Labruna, Adam Wierzbicki, Giovanni Da San Martino
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06842
Pdf URL: https://arxiv.org/pdf/2506.06842
Copy Paste: [[2506.06842]] PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation(https://arxiv.org/abs/2506.06842)
Keywords: language model, llm
Abstract: Disinformation detection is a key aspect of media literacy. Psychological studies have shown that knowledge of persuasive fallacies helps individuals detect disinformation. Inspired by these findings, we experimented with large language models (LLMs) to test whether infusing persuasion knowledge enhances disinformation detection. As a result, we introduce the Persuasion-Augmented Chain of Thought (PCoT), a novel approach that leverages persuasion to improve disinformation detection in zero-shot classification. We extensively evaluate PCoT on online news and social media posts. Moreover, we publish two novel, up-to-date disinformation datasets: EUDisinfo and MultiDis. These datasets enable the evaluation of PCoT on content entirely unseen by the LLMs used in our experiments, as the content was published after the models' knowledge cutoffs. We show that, on average, PCoT outperforms competitive methods by 15% across five LLMs and five datasets. These findings highlight the value of persuasion in strengthening zero-shot disinformation detection.
摘要：虚假信息检测是媒体素养的关键方面。心理学研究表明，有说服力的谬论的知识有助于个人发现虚假信息。受这些发现的启发，我们尝试了大型语言模型（LLMS），以测试注入说服知识是否增强了虚假信息检测。结果，我们介绍了说服力的思想链（PCOT），这是一种新型方法，利用说服力来改善零照片分类中的虚假信息检测。我们在线新闻和社交媒体帖子上广泛评估PCOT。此外，我们发布了两个小说，最新的虚假信息数据集：Eudisinfo和Multidis。这些数据集使我们实验中使用的LLM完全看不见的内容对PCOT进行评估，因为在模型的知识截止之后，内容发表了。我们表明，在五个LLM和五个数据集中，PCOT平均比竞争方法的表现高15％。这些发现突出了说服力在加强零射击虚假信息检测中的价值。

Title: Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models

Authors: Naibin Gu, Peng Fu, Xiyu Liu, Ke Ma, Zheng Lin, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06844
Pdf URL: https://arxiv.org/pdf/2506.06844
Copy Paste: [[2506.06844]] Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models(https://arxiv.org/abs/2506.06844)
Keywords: language model
Abstract: Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these numerous modules to restore performance would incur significant computational costs. Through a comprehensive analysis of the changes that occur during base model updates, we uncover an interesting phenomenon: continual training primarily affects task-specific knowledge stored in Feed-Forward Networks (FFN), while having less impact on the task-specific pattern in the Attention mechanism. Based on these findings, we introduce Trans-PEFT, a novel approach that enhances the PEFT module by focusing on the task-specific pattern while reducing its dependence on certain knowledge in the base model. Further theoretical analysis supports our approach. Extensive experiments across 7 base models and 12 datasets demonstrate that Trans-PEFT trained modules can maintain performance on updated base models without re-tuning, significantly reducing maintenance overhead in real-world applications.
摘要：参数有效的微调（PEFT）已成为微调大语言模型的常见方法，基本模型可以通过PEFT模块切换为多个用户提供服务。为了增强用户体验，基本模型需要定期更新。但是，一旦更新，PEFT模块对以前的版本进行了微调，通常会在新版本上遭受大量的性能降解。重新调整这些众多模块以恢复性能将产生巨大的计算成本。通过对基本模型更新过程中发生的变化的全面分析，我们发现了一个有趣的现象：持续培训主要影响馈送前向网络（FFN）中存储的特定任务知识，同时对注意力机制中的任务特异性模式的影响较小。基于这些发现，我们介绍了Trans-Peft，这是一种新颖的方法，可以通过关注特定于任务的模式来增强PEFT模块，同时降低其对基本模型中某些知识的依赖。进一步的理论分析支持我们的方法。跨7个基本型号和12个数据集进行的大量实验表明，经过跨PEFT训练的模块可以在不重新调整的情况下保持更新的基本模型的性能，从而大大降低了实际应用程序中的维护开销。

Title: Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning

Authors: Jiaxing Guo, Wenjie Yang, Shengzhong Zhang, Tongshan Xu, Lun Du, Da Zheng, Zengfeng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06877
Pdf URL: https://arxiv.org/pdf/2506.06877
Copy Paste: [[2506.06877]] Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning(https://arxiv.org/abs/2506.06877)
Keywords: language model, llm
Abstract: Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs' answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to baselines, especially for complex, multi-step problems. This offers a more robust path towards evaluating and training LLMs with genuine mathematical reasoning.
摘要：结果奖励的大语言模型（LLM）在数学问题解决方面取得了巨大的成功。但是，这一成功通常掩盖了一个关键问题：模型经常通过根本上不健全的推理过程获得正确的答案，这是指奖励黑客的现象。我们介绍了一种具有细粒注释的新数据集Matholympiadeval，这揭示了LLMS的答案正确性与其低过程正确性之间存在显着差距。现有的自动化方法（例如LLM-AS-A-Gudge）难以可靠地检测到这些推理缺陷。为了解决这个问题，我们提出了parastepverier，这是一种对数学解决方案进行细致，分步验证的新方法。 parastepverifier确定了不正确的推理步骤。经验结果表明，与基准相比，养老院养育者显着提高了鉴定有缺陷溶液的准确性，尤其是对于复杂的多步骤问题。这为通过真实的数学推理评估和培训LLM提供了更强大的途径。

Title: Mixture of Small and Large Models for Chinese Spelling Check

Authors: Ziheng Qiao, Houquan Zhou, Zhenghua Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06887
Pdf URL: https://arxiv.org/pdf/2506.06887
Copy Paste: [[2506.06887]] Mixture of Small and Large Models for Chinese Spelling Check(https://arxiv.org/abs/2506.06887)
Keywords: language model, llm
Abstract: In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at this https URL.
摘要：在大型语言模型（LLM）时代，中国拼写检查（CSC）任务已经开发出各种LLM方法，但其性能仍然不令人满意。相比之下，基于微调的基于BERT的模型依赖于高质量的内域数据，表现出出色的性能，但遭受了编辑模式过度拟合。本文提出了一种新型的动态混合方法，该方法在梁搜索解码阶段有效地结合了小型和LLM的概率分布，从小模型和LLM的流利度达到了平衡的精确校正。这种方法还消除了对LLM的微调，节省了大量时间和资源，并促进了域的适应性。全面的实验表明，我们的混合方法可以显着提高误差校正能力，从而在多个数据集中实现最新的结果。我们的代码可在此HTTPS URL上找到。

Title: Automatic Speech Recognition of African American English: Lexical and Contextual Effects

Authors: Hamid Mojarad, Kevin Tang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.06888
Pdf URL: https://arxiv.org/pdf/2506.06888
Copy Paste: [[2506.06888]] Automatic Speech Recognition of African American English: Lexical and Contextual Effects(https://arxiv.org/abs/2506.06888)
Keywords: language model
Abstract: Automatic Speech Recognition (ASR) models often struggle with the phonetic, phonological, and morphosyntactic features found in African American English (AAE). This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. It examines whether the presence of CCR and ING-reduction increases ASR misrecognition. Subsequently, it investigates whether end-to-end ASR systems without an external Language Model (LM) are more influenced by lexical neighborhood effect and less by contextual predictability compared to systems with an LM. The Corpus of Regional African American Language (CORAAL) was transcribed using wav2vec 2.0 with and without an LM. CCR and ING-reduction were detected using the Montreal Forced Aligner (MFA) with pronunciation expansion. The analysis reveals a small but significant effect of CCR and ING on Word Error Rate (WER) and indicates a stronger presence of lexical neighborhood effect in ASR systems without LMs.
摘要：自动语音识别（ASR）模型通常在非裔美国人英语（AAE）中发现的语音，语音和形态句法特征。这项研究侧重于两个关键的AAE变量：辅音群集还原（CCR）和Ing-ycuction。它检查了CCR和Ing-Ractuction的存在是否增加了ASR错误识别。随后，它研究了没有外部语言模型（LM）的端到端ASR系统是否受词汇邻域效应的影响更大，而与LM的系统相比，在上下文可预测性的影响下。使用和没有LM的WAV2VEC 2.0转录非裔美国人语言（Coraal）的语料库。使用具有发音膨胀的蒙特利尔强制比对（MFA）检测到CCR和Ing-Rectuction。该分析表明，CCR和ING对单词错误率（WER）的影响很小，但表明在没有LMS的ASR系统中存在更强的词汇邻居效应。

Title: DiscoSum: Discourse-aware News Summarization

Authors: Alexander Spangher, Tenghao Huang, Jialiang Gu, Jiatong Shi, Muhao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06930
Pdf URL: https://arxiv.org/pdf/2506.06930
Copy Paste: [[2506.06930]] DiscoSum: Discourse-aware News Summarization(https://arxiv.org/abs/2506.06930)
Keywords: language model
Abstract: Recent advances in text summarization have predominantly leveraged large language models to generate concise summaries. However, language models often do not maintain long-term discourse structure, especially in news articles, where organizational flow significantly influences reader engagement. We introduce a novel approach to integrating discourse structure into summarization processes, focusing specifically on news articles across various media. We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms (e.g. LinkedIn, Facebook, etc.). We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization, enabling the transformation of news stories to meet different stylistic and structural demands. Both human and automatic evaluation results demonstrate the efficacy of our approach in maintaining narrative fidelity and meeting structural requirements.
摘要：文本摘要的最新进展主要利用了大型语言模型来产生简洁的摘要。但是，语言模型通常不能保持长期的话语结构，尤其是在新闻文章中，组织流程会显着影响读者的参与。我们介绍了一种新颖的方法，将话语结构整合到摘要过程中，专门针对各种媒体的新闻文章。我们提供了一个新颖的摘要数据集，其中新闻文章在不同的社交媒体平台（例如LinkedIn，Facebook等）中以不同的方式进行了多次汇总。我们开发了一种新颖的新闻话语模式来描述摘要结构和一种新颖的算法Discosum，该算法采用了光束搜索技术来实现结构意识到的摘要，从而使新闻故事的转换能够满足不同的风格和结构需求。人类和自动评估结果都证明了我们方法在保持叙事忠诚和满足结构要求方面的功效。

Title: What Makes a Good Natural Language Prompt?

Authors: Do Xuan Long, Duy Dinh, Ngoc-Hai Nguyen, Kenji Kawaguchi, Nancy F. Chen, Shafiq Joty, Min-Yen Kan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06950
Pdf URL: https://arxiv.org/pdf/2506.06950
Copy Paste: [[2506.06950]] What Makes a Good Natural Language Prompt?(https://arxiv.org/abs/2506.06950)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) have progressed towards more human-like and human--AI communications have become prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025 and blogs. We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions. We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps. Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact. Finally, we discover that instruction-tuning on property-enhanced prompts can result in better reasoning models. Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human--AI communication and opening new prompting research directions.
摘要：随着大型语言模型（LLM）朝着更像人类和人类的发展发展，AI的通信变得普遍，促使促使作为决定性的组成部分。但是，关于量化自然语言提示的确切概念共识有限。我们试图通过进行荟萃分析来解决这个问题，从2022年至2025年和博客中进行了150多个与领先的NLP和AI会议的促使有关的论文。我们提出了一个属性和以人为中心的框架来评估及时质量，其中包括21个属性，分为六个维度。然后，我们研究了现有研究如何评估其对LLM的影响，揭示了他们在模型和任务之间的支持不平衡的支持以及大量的研究差距。此外，我们分析了高质量自然语言提示中属性之间的相关性，从而提出了提示。然后，我们从经验上探索了推理任务中的多毛皮迅速增强，观察到单期增强功能通常会产生最大的影响。最后，我们发现对属性增强提示的说明可以导致更好的推理模型。我们的发现为以财产为中心的及时评估和优化奠定了基础，弥合了人类交流与打开新的促使研究方向之间的差距。

Title: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Authors: Ha-Thanh Nguyen, Chaoran Liu, Hirokazu Kiyomaru, Koichi Takeda, Yusuke Miyao, Maki Matsuda, Yusuke Oda, Pontus Stenetorp, Qianying Liu, Su Myat Noe, Hideyuki Tachibana, Kouta Nakayama, Sadao Kurohashi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06955
Pdf URL: https://arxiv.org/pdf/2506.06955
Copy Paste: [[2506.06955]] BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning(https://arxiv.org/abs/2506.06955)
Keywords: language model, gpt, llm
Abstract: We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
摘要：我们提出了BIS推理1.0，这是第一个大型日本的三段论推理问题数据集，该数据集明确设计，旨在评估大语言模型（LLMS）中的信念不合时宜的推理。与以前的数据集（如Neubaroco和JFLD）不同，侧重于一般或信念一致的推理，BIS推理1.0 1.0引入了逻辑上有效但具有信念的三段论，以发现接受过人为平衡的Corpora培训的LLMS中的推理偏见。我们基准的最新模型（包括GPT模型，Claude模型和领先的日本LLM）揭示了性能的显着差异，GPT-4O的精度达到了79.54％。我们的分析在处理逻辑上有效但对信仰冲突的输入时确定了当前LLM的关键弱点。这些发现对在法律，医疗保健和科学文献等高风险领域中部署LLM具有重要意义，在这里，真理必须超越直觉的信念以确保正直和安全。

Title: Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

Authors: Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06964
Pdf URL: https://arxiv.org/pdf/2506.06964
Copy Paste: [[2506.06964]] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning(https://arxiv.org/abs/2506.06964)
Keywords: language model, agent
Abstract: Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.
摘要：问题回答（QA）代理会自动回答自然语言提出的问题。在这项工作中，我们学会了在质量检查代理中提出澄清的问题。我们方法中的关键思想是模拟包含澄清问题并使用强化学习（RL）学习的对话。为了使RL实用，我们提出和分析离线RL目标，这些目标可以被视为奖励加权监督的微调（SFT），并在大型语言模型中易于优化。我们的工作与最近提出的基于SFT和直接偏好优化的方法形成鲜明对比，这些方法具有其他超参数，并且不会直接优化奖励。我们从经验上与这些方法进行比较，并报告了优化的奖励和语言质量的收益。

Title: Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Authors: Jaechul Roh, Varun Gandhi, Shivani Anilkumar, Arin Garg
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06971
Pdf URL: https://arxiv.org/pdf/2506.06971
Copy Paste: [[2506.06971]] Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation(https://arxiv.org/abs/2506.06971)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we systematically investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations. Our evaluation -- spanning 700 perturbed code generations derived from LeetCode-style problems -- applies transformations such as storytelling reframing, irrelevant constraint injection, example reordering, and numeric perturbation. We observe that while certain modifications severely degrade performance (with accuracy drops up to -42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting sensitivity not only to semantics but also to surface-level prompt dynamics. These findings expose the fragility and unpredictability of current reasoning systems, underscoring the need for more principles approaches to reasoning alignments and prompting robustness. We release our perturbation datasets and evaluation framework to promote further research in trustworthy and resilient LLM reasoning.
摘要：大型语言模型（LLM）在需要复杂推理的任务中取得了巨大的成功，例如代码生成，数学问题解决和算法综合，尤其是在通过推理令牌和思想链条提示的帮助下。然而，一个核心问题仍然存在：这些模型是否真正理由，还是仅利用浅统计模式？在本文中，我们通过引入一套语义上忠实但对抗性结构的及时扰动来系统地研究推理LLM的鲁棒性。我们的评估 - 跨越了从leetcode式问题得出的700个扰动的代码世代 - 应用了诸如讲故事的转换，重新构图，无关紧要的约束注入，示例重新排序和数字扰动。我们观察到，尽管某些修改严重降低了性能（精度下降到-42.1％），但其他修改令人惊讶地提高了模型准确性高达35.3％，这不仅表明对语义，而且表明表面级及时的迅速动力学。这些发现暴露了当前推理系统的脆弱性和不可预测性，强调了对推理一致性的更多原理方法的需求并促使鲁棒性。我们发布了扰动数据集和评估框架，以促进可信赖和弹性LLM推理的进一步研究。

Title: Atomic Reasoning for Scientific Table Claim Verification

Authors: Yuji Zhang, Qingyun Wang, Cheng Qian, Jiateng Liu, Chenkai Sun, Denghui Zhang, Tarek Abdelzaher, Chengxiang Zhai, Preslav Nakov, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06972
Pdf URL: https://arxiv.org/pdf/2506.06972
Copy Paste: [[2506.06972]] Atomic Reasoning for Scientific Table Claim Verification(https://arxiv.org/abs/2506.06972)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning, resulting in errors and a lack of precision in verifying scientific claims. Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load by developing modular, reusable reasoning components (i.e., atomic skills). We introduce a skill-chaining schema that dynamically composes these skills to facilitate more accurate and generalizable reasoning with a reduced cognitive load. To evaluate this, we create SciAtomicBench, a cross-domain benchmark with fine-grained reasoning annotations. With only 350 fine-tuning examples, our model trained by atomic reasoning outperforms GPT-4o's chain-of-thought method, achieving state-of-the-art results with far less training data.
摘要：科学文本经常由于其技术语言和复杂数据而传达权威。但是，这种复杂性有时会导致错误信息的传播。由于其高度密度和可感知的信誉，非专家特别容易受到基于科学表的误导性主张的影响。现有的表索赔验证模型，包括最先进的大语言模型（LLM），通常会在精确的细粒度推理中挣扎，导致错误和在验证科学主张方面缺乏精确性。受认知负载理论的启发，我们建议增强模型解释基于表的主张的能力涉及通过开发模块化的，可重复使用的推理组件（即原子技能）来减少认知载荷。我们介绍了一种动态构成这些技能的技能链接模式，以减少认知负载，以促进更准确和可推广的推理。为了评估这一点，我们创建了SciatoMicicBench，这是一种带有细粒推理注释的跨域基准测试。只有350个微调示例，我们的模型由原子推理培训的模型优于GPT-4O的经过思考链方法，从而获得了最先进的结果，而培训数据却少得多。

Title: Chain of Methodologies: Scaling Test Time Computation without Training

Authors: Cong Liu, Jie Wu, Weigang Wu, Xu Chen, Liang Lin, Wei-Shi Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06982
Pdf URL: https://arxiv.org/pdf/2506.06982
Copy Paste: [[2506.06982]] Chain of Methodologies: Scaling Test Time Computation without Training(https://arxiv.org/abs/2506.06982)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are typically absent in publicly available documents. This paper introduces the Chain of Methodologies (CoM), an innovative and intuitive prompting framework that enhances structured thinking by integrating human methodological insights, enabling LLMs to tackle complex tasks with extended reasoning. CoM leverages the metacognitive abilities of advanced LLMs, activating systematic reasoning throught user-defined methodologies without explicit fine-tuning. Experiments show that CoM surpasses competitive baselines, demonstrating the potential of training-free prompting methods as robust solutions for complex reasoning tasks and bridging the gap toward human-level reasoning through human-like methodological insights.
摘要：大型语言模型（LLMS）通常由于其培训数据中的深入见解而经常在复杂的推理任务中挣扎，而这些培训数据中通常在公开可用的文档中不存在。本文介绍了方法链（COM），这是一个创新和直观的提示框架，通过整合人类的方法论见解来增强结构性思维，从而使LLMS能够将复杂的任务与扩展的推理解决。 COM利用高级LLM的元认知能力，通过用户定义的方法激活系统推理，而无需显式微调。实验表明，COM超过了竞争基线，证明了无训练提示方法作为复杂推理任务的强大解决方案的潜力，并通过类似人类的方法论见解将差距弥合了人类水平的推理。

Title: Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors

Authors: Senqi Yang, Dongyu Zhang, Jing Ren, Ziqi Xu, Xiuzhen Zhang, Yiliao Song, Hongfei Lin, Feng Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06987
Pdf URL: https://arxiv.org/pdf/2506.06987
Copy Paste: [[2506.06987]] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors(https://arxiv.org/abs/2506.06987)
Keywords: language model
Abstract: Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models. Our dataset and code are available at this https URL.
摘要：隐喻在沟通中普遍存在，这对于自然语言处理至关重要（NLP）。先前关于自动隐喻处理的研究主要依赖于由英国样本组成的培训数据，这些数据通常反映了西欧或北美的偏见。这种文化偏斜会导致高估模型性能和对NLP进步的贡献。但是，文化偏见对隐喻处理的影响，尤其是在多模式环境中，在很大程度上尚未探索。为了解决这一差距，我们介绍了MultiMM，这是一种多元文化的多模式隐喻数据集，旨在涉及中文和英语的跨文化研究。 MultiMM由8,461个文本图像广告对组成，每个广告对伴随着细粒度的注释，从一个文化领域以外的多模式隐喻提供了更深入的了解。此外，我们提出了富含情感的隐喻检测（SEMD），这是一种基线模型，该模型整合了情感嵌入，以增强跨文化背景的隐喻理解。实验结果验证了SEMD对隐喻检测和情感分析任务的有效性。我们希望这项工作能够提高NLP研究中对文化偏见的认识，并有助于开发更公平，更具包容性的语言模型。我们的数据集和代码可在此HTTPS URL上找到。

Title: Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Authors: Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07001
Pdf URL: https://arxiv.org/pdf/2506.07001
Copy Paste: [[2506.07001]] Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text(https://arxiv.org/abs/2506.07001)
Keywords: language model, gpt, llm
Abstract: The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
摘要：大型语言模型（LLM）的能力越来越多，引起了人们对AI生成的窃和社会工程学的滥用的担忧。尽管已经提出了各种AI生成的文本检测器来减轻这些风险，但许多探测器仍然容易受到简单的逃避技术的影响，例如释义。但是，最近的探测器对这种基本攻击表现出更大的鲁棒性。在这项工作中，我们引入了对抗性释义，这是一个无训练的攻击框架，它普遍地使任何AI生成的文本人性化，以更有效地逃避检测。我们的方法在AI文本探测器的指导下利用了核心遵循的LLM来解释AI生成的内容，从而产生了针对的对抗性示例，这些示例被专门优化以绕过检测。广泛的实验表明，我们的攻击在几个检测系统中既有效又可转移。例如，与简单的释义攻击相比，具有讽刺意味的是，雷达的真实阳性在1％的假阳性（t@1％F）上增加了8.57％，在快速访问的范围内将15.03％的人提高到15.03％的 - 对抗性释义，在Openai-Roberta-large的指导下，将t@1％F@1％F降低了64.49％的radar和速度为98％。在Openai-Roberta-large的指导下，在包括基于神经网络的基于神经网络的探测器（包括基于神经网络，基于水印和零照片的方法）中，我们的攻击达到87.88％。我们还分析了文本质量和攻击成功之间的权衡，发现我们的方法可以大大降低检测率，而文本质量大多会略有降级。我们的对抗性设置强调了鉴于日益复杂的逃避技术的需求，需要进行更健壮和弹性的检测策略。

Title: KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering

Authors: Zhongze Luo, Weixuan Wan, Qizhi Zheng, Yanhong Bai, Jingyun Sun, Jian Wang, Dan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07037
Pdf URL: https://arxiv.org/pdf/2506.07037
Copy Paste: [[2506.07037]] KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering(https://arxiv.org/abs/2506.07037)
Keywords: language model, retrieval-augmented generation
Abstract: There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question-answering system for communication standards. The experimental results show that after LoRA tuning on the constructed dataset of 6,587 questions and answers in the field of communication standards, Qwen2.5-7B-Instruct demonstrates outstanding professional capabilities in the field of communication standards on the test set. BLEU-4 rose from 18.8564 to 66.8993, and evaluation indicators such as ROUGE also increased significantly, outperforming the fine-tuning effect of the comparison model Llama-3-8B-Instruct. Based on the ontology framework containing 6 entity attributes and 10 relation attributes, a knowledge graph of the communication standard domain containing 13,906 entities and 13,524 relations was constructed, showing a relatively good query accuracy rate. The intelligent consultation and question-answering system enables the fine-tuned model on the server side to access the locally constructed knowledge graph and conduct graphical retrieval of key information first, which is conducive to improving the question-answering effect. The evaluation using DeepSeek as the Judge on the test set shows that our RAG framework enables the fine-tuned model to improve the scores at all five angles, with an average score increase of 2.26%. And combined with web services and API interfaces, it has achieved very good results in terms of interaction experience and back-end access, and has very good practical application value.
摘要：通信领域中有许多类型的标准。传统咨询模型的周期很长，并且依赖于专家的知识和经验，因此很难满足快速发展的技术需求。本文将大语模型的微调与知识图的构建结合在一起，以实施智能咨询和交流标准的提问系统。实验结果表明，在洛拉（Lora）调整了在沟通标准领域的6,587个问题和答案的构建数据集上后，QWEN2.5-7B教学表明，在测试集中的通信标准领域表现出了出色的专业能力。 BLEU-4从18.8564上升到66.8993，评估指标（例如Rouge）也显着增加，表现优于比较模型Llama-3-8B教学的微调效应。基于包含6个实体属性和10个关系属性的本体框架，构建了包含13,906个实体和13,524个关系的通信标准域的知识图，显示了相对良好的查询准确率。智能咨询和提问系统使服务器端的微调模型能够访问本地构造的知识图，并首先对关键信息的图形检索进行图形检索，这有利于提高问题的避开效应。使用DeepSeek作为测试集的法官的评估表明，我们的抹布框架使微调模型可以在所有五个角度上提高分数，平均得分增加了2.26％。结合Web服务和API接口，它在互动经验和后端访问方面取得了很好的成果，并且具有很好的实用应用价值。

Title: Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants

Authors: Stergios Chatzikyriakidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07042
Pdf URL: https://arxiv.org/pdf/2506.07042
Copy Paste: [[2506.07042]] Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants(https://arxiv.org/abs/2506.07042)
Keywords: gpt, llm, retrieval-augmented generation
Abstract: Extracting structured computational representations of historical events from narrative text remains computationally expensive when constructed manually. While RDF/OWL reasoners enable graph-based reasoning, they are limited to fragments of first-order logic, preventing deeper temporal and semantic analysis. This paper addresses both challenges by developing automatic historical event extraction models using multiple LLMs (GPT-4, Claude, Llama 3.2) with three enhancement strategies: pure base generation, knowledge graph enhancement, and Retrieval-Augmented Generation (RAG). We conducted comprehensive evaluations using historical texts from Thucydides. Our findings reveal that enhancement strategies optimize different performance dimensions rather than providing universal improvements. For coverage and historical breadth, base generation achieves optimal performance with Claude and GPT-4 extracting comprehensive events. However, for precision, RAG enhancement improves coordinate accuracy and metadata completeness. Model architecture fundamentally determines enhancement sensitivity: larger models demonstrate robust baseline performance with incremental RAG improvements, while Llama 3.2 shows extreme variance from competitive performance to complete failure. We then developed an automated translation pipeline converting extracted RDF representations into Coq proof assistant specifications, enabling higher-order reasoning beyond RDF capabilities including multi-step causal verification, temporal arithmetic with BC dates, and formal proofs about historical causation. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures rather than ontological violations.
摘要：从叙事文本中提取历史事件的结构化计算表示，手动构造时，计算上的计算量仍然保持昂贵。尽管RDF/OWL推理器启用了基于图的推理，但它们仅限于一阶逻辑的片段，以防止更深入的时间和语义分析。本文通过使用多个LLM（GPT-4，Claude，Llama 3.2）开发自动历史事件提取模型来解决这两个挑战：具有三种增强策略：纯基本生成，知识图增强和检索增强的生成（RAG）。我们使用修昔底德的历史文本进行了全面的评估。我们的发现表明，增强策略优化了不同的性能维度，而不是提供普遍的改进。对于覆盖范围和历史广度，基本的生成通过Claude和GPT-4提取全面的活动来实现最佳性能。但是，为了精确，抹布增强可提高坐标精度和元数据的完整性。模型体系结构从根本上确定增强灵敏度：较大的模型表明了稳健的基线性能，并改进了碎布，而Llama 3.2显示了从竞争性能到完全失败的极大差异。然后，我们开发了一个自动翻译管道，将提取的RDF表示转换为COQ证明助手规格，从而超出了RDF功能的高阶推理，包括多步性因果验证，带有BC日期的时间算法以及有关历史原因的正式证明。 COQ形式化验证了抹布发现的事件类型代表合法的领域特定的语义结构，而不是本体论违规。

Title: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Authors: LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07044
Pdf URL: https://arxiv.org/pdf/2506.07044
Copy Paste: [[2506.07044]] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning(https://arxiv.org/abs/2506.07044)
Keywords: language model, llm, hallucination
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...
摘要：多模式的大语言模型（MLLM）在理解常见的视觉元素方面表现出了令人印象深刻的能力，这主要是由于它们的大规模数据集和高级培训策略。但是，由于医疗方案与一般领域中的数据和任务之间的固有差异，它们在医疗应用中的有效性仍然有限。具体而言，现有的医疗MLLM面临以下临界局限性：（1）由于次优数据策划过程而导致的幻觉效果增强了对医学知识的覆盖有限，（2）缺乏针对复杂医疗场景量身定制的推理能力。为了应对这些挑战，我们首先提出了一项全面的数据策划程序，（1）不仅从医学成像中，而且还从广泛的医学文本和通用域数据中有效地获取丰富的医学知识数据；（2）合成准确的医疗标题，视觉问题答案（VQA）和推理样本。结果，我们构建了一个充满广泛医学知识的多模式数据集。在策划数据的基础上，我们介绍了我们的医学专业MLLM：Lingshu。 Lingshu接受了多阶段培训，以逐步嵌入医学专业知识并逐步增强其任务解决能力。此外，我们初步探讨了使用可验证的奖励范式应用强化学习的潜力，以增强Lingshu的医疗推理能力。此外，我们开发了Medevalkit，这是一个统一的评估框架，可为标准化，公平和有效的模型评估巩固领先的多模式和文本医学基准。我们评估了Lingshu在三个基本医疗任务，多模式质量检查，基于文本的QA和医学报告的生成方面的表现。结果表明，在大多数任务上，Lingshu始终优于现有的开源多模型。

Title: Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models

Authors: Kai Xiong, Xiao Ding, Yixin Cao, Yuxiong Yan, Li Du, Yufei Zhang, Jinglong Gao, Jiaqian Liu, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07064
Pdf URL: https://arxiv.org/pdf/2506.07064
Copy Paste: [[2506.07064]] Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models(https://arxiv.org/abs/2506.07064)
Keywords: language model, llm
Abstract: Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at this https URL.
摘要：大型语言模型（LLMS）通过预训练掌握了丰富的简单和明确的常识知识，从而使它们能够在简单的常识性推理中实现类似人类的表现。然而，LLM努力用简单的常识性知识（例如了解某些事件的长期影响）来推理，人类倾向于更多地关注更多。现有的作品着重于数学和代码等复杂任务，而复杂的常识性推理由于其不确定性和缺乏结构而保持不足。为了填补这一空白并与现实世界中的问题保持一致，我们提出了一个基准com $^2 $，专注于复杂的常识性推理。我们首先结合了因果事件图，以用作结构化的复杂常识。然后，我们采用因果理论〜（例如，干预）来修改因果事件图，并获得符合人类关注的不同情况。最后，LLM被用来以缓慢思考的方式合成示例，这是由修改后的因果图中的逻辑关系指导的。此外，我们使用侦探故事来构建一个更具挑战性的子集。实验表明，LLM在推理深度和广度方面挣扎，而训练后和缓慢的思维可以减轻这一点。该代码和数据可在此HTTPS URL上找到。

Title: Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing

Authors: Yuanhe Tian, Pengsen Cheng, Guoqing Jin, Lei Zhang, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07086
Pdf URL: https://arxiv.org/pdf/2506.07086
Copy Paste: [[2506.07086]] Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing(https://arxiv.org/abs/2506.07086)
Keywords: llm, prompt
Abstract: Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.
摘要：多模式情感计算旨在自动识别和解释来自图像和文本等不同数据源的人类态度，从而增强人类计算机的互动和情感理解。现有方法通常依赖于单峰分析或跨模式信息的直接融合，这些信息未能捕获不同模式中提出的复杂和相互冲突的证据。在本文中，我们提出了一种基于LLM的新方法，用于情感计算，将视觉和文本表示形式明确解构为共享（模态不变）和特定于模态的组件。具体而言，我们的方法首先使用预训练的多模式编码器编码和对齐输入方式，然后采用代表分解框架将共同的情绪内容与独特的提示分开，并最终通过注意力机制整合了这些分解的信号，以形成动态软促使多模式的多模式LLM。对三个代表性计算的代表性任务进行了广泛的实验，即基于多模式的情感分析，多模式情绪分析和可恨的模因检测，这证明了我们方法的有效性，这始终超过了强大的基准和最新模型。

Title: How Far Are We from Optimal Reasoning Efficiency?

Authors: Jiaxuan Gao, Shu Yan, Qixin Tan, Lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, Yi Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07104
Pdf URL: https://arxiv.org/pdf/2506.07104
Copy Paste: [[2506.07104]] How Far Are We from Optimal Reasoning Efficiency?(https://arxiv.org/abs/2506.07104)
Keywords: chain-of-thought
Abstract: Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning base LRMs across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks reveals significant gaps in current methods: they either sacrifice accuracy for short length or still remain inefficient under tight token budgets. To reduce the efficiency gap, we propose REO-RL, a class of Reinforcement Learning algorithms that minimizes REG by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Through systematic benchmarking, we demonstrate that our efficiency metric, REG, effectively captures the accuracy-length trade-off, with low-REG methods reducing length while maintaining accuracy. Our approach, REO-RL, consistently reduces REG by >=50 across all evaluated LRMs and matching Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy loss. Ablation studies confirm the effectiveness of our exponential token budget strategy. Finally, our findings highlight that fine-tuning LRMs to perfectly align with the efficiency frontiers remains an open challenge.
摘要：大型推理模型（LRMS）通过扩展的思想链（COT）推理表现出显着的解决问题的能力，但通常会产生过多的冗长和冗余的推理轨迹。这种效率低下会导致高推理成本和限制实际部署。尽管现有的微调方法旨在提高推理效率，但由于评估不一致，评估其效率提高仍然具有挑战性。在这项工作中，我们介绍了推理效率边界，从各种方法和培训配置中源自微调基础LRM的经验上限。基于这些边界，我们提出了推理效率差距（REG），这是对这些边界的任何微调LRM的统一度量量化偏差。对具有挑战性的数学基准的系统评估揭示了当前方法的巨大差距：它们要么牺牲了短长的精度，要么在标记预算紧张的情况下仍然保持降低。为了减少效率差距，我们提出了REO-RL，这是一类强化学习算法，通过以稀疏的代币预算为目标来最大程度地减少REG。利用数值集成在战略性选择的预算上，REO-RL使用一小部分代币预算近似于较低误差的全部效率目标。通过系统的基准测试，我们证明了我们的效率指标有效地捕获了准确性的权衡，低分方法可以降低长度，同时保持准确性。在所有评估的LRM中，我们的方法REO-RL始终将REG降低> = 50，并在16K代币预算下匹配QWEN3-4B/8B效率边界，精度损失最小。消融研究证实了我们指数标记预算策略的有效性。最后，我们的发现强调，与效率边界完美保持一致的微调LRM仍然是一个开放的挑战。

Title: Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Authors: Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07106
Pdf URL: https://arxiv.org/pdf/2506.07106
Copy Paste: [[2506.07106]] Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models(https://arxiv.org/abs/2506.07106)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at this https URL.
摘要：大型语言模型（LLMS）在自然语言推理任务中表现出很强的表现，但是它们的推理过程仍然脆弱且难以解释。促使诸如Thebough（COT）（COT）之类的技术通过引发中间推理步骤或汇总多个输出来增强可靠性。但是，它们缺乏实施逻辑结构和评估内部连贯性的机制。我们介绍了思想定理（TOTH），这是一个新颖的框架，将推理作为三种平行代理之间的协作建模，每种都模拟了一种独特的推理方式：绑架，演绎和感应性。每个代理都会产生一个推理轨迹，该轨迹构成形式的推理图。为了评估一致性，我们应用了以自然语言推理（NLI）为指导的贝叶斯信仰传播，为每个步骤分配了置信度得分。选择最连贯的图以得出最终答案。对符号（Weboflies）和数值（Multiarith）推理基准的实验表明，TOTH始终在多个LLMS上胜过COT，自矛盾和COT编码，同时产生可解释的且逻辑上扎根的推理链。我们的发现提出了一个有希望的方向，可以建立更强大和认知灵感的LLM推理。该实现可在此HTTPS URL上获得。

Title: Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Authors: Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07142
Pdf URL: https://arxiv.org/pdf/2506.07142
Copy Paste: [[2506.07142]] Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting(https://arxiv.org/abs/2506.07142)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to "think step by step" (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things: - The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers. - For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response.
摘要：这是一系列简短报告中的第二份，该报告旨在帮助商业，教育和政策领导者通过严格的测试来了解与AI合作的技术细节。在本报告中，我们调查了思想链（COT）提示，该技术鼓励大型语言模型（LLM）“逐步思考”（Wei等，2022）。 COT是一种改善推理任务的广泛采用方法，但是，我们的发现表明了其有效性更细微的图片。我们演示了两件事： - 经过思考链的提示的有效性可能会取决于任务和模型的类型。对于非争议模型，COT通常会将平均性能提高少量，尤其是如果该模型本质上不逐步进行默认情况下进行逐步处理。但是，COT可以在答案中引入更多的可变性，有时会触发该模型会变得正确的问题中偶尔出现错误。我们还发现，即使没有询问，许多最近的模型即使没有询问。对于这些模型，执行COT的请求几乎没有影响。与直接答案相比，执行COT通常需要更多的令牌（成本和时间增加）。 - 对于具有明确推理功能设计的模型，COT通常只会导致边际（如果有）获得答案的准确性。但是，它显着增加了时间，并且代币需要产生响应。

Title: Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis

Authors: Yaping Chai, Haoran Xie, Joe S. Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07148
Pdf URL: https://arxiv.org/pdf/2506.07148
Copy Paste: [[2506.07148]] Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis(https://arxiv.org/abs/2506.07148)
Keywords: language model, llm, prompt
Abstract: Large language model (LLM) is an effective approach to addressing data scarcity in low-resource scenarios. Recent existing research designs hand-crafted prompts to guide LLM for data augmentation. We introduce a data augmentation strategy for the aspect category sentiment analysis (ACSA) task that preserves the original sentence semantics and has linguistic diversity, specifically by providing a structured prompt template for an LLM to generate predefined content. In addition, we employ a post-processing technique to further ensure semantic consistency between the generated sentence and the original sentence. The augmented data increases the semantic coverage of the training distribution, enabling the model better to understand the relationship between aspect categories and sentiment polarities, enhancing its inference capabilities. Furthermore, we propose a confidence-weighted fine-tuning strategy to encourage the model to generate more confident and accurate sentiment polarity predictions. Compared with powerful and recent works, our method consistently achieves the best performance on four benchmark datasets over all baselines.
摘要：大型语言模型（LLM）是解决低资源场景中数据稀缺性的有效方法。现有的研究设计手工制作的提示指导LLM进行数据增强。我们介绍了方面类别情感分析（ACSA）任务的数据增强策略，该策略保留了原始句子语义并具有语言多样性，特别是通过为LLM提供结构化的及时模板来生成预定义的内容。此外，我们采用了一种后处理技术来进一步确保生成的句子和原始句子之间的语义一致性。增强数据增加了训练分布的语义覆盖范围，从而使模型更好地了解方面类别和情感极性之间的关系，从而增强其推理能力。此外，我们提出了一种信心加权的微调策略，以鼓励模型产生更自信和准确的情感极性预测。与强大和最近的作品相比，我们的方法始终在所有基线的四个基准数据集上取得了最佳性能。

Title: Syntactic Control of Language Models by Posterior Inference

Authors: Vicky Xefteri, Tim Vieira, Ryan Cotterell, Afra Amini
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07154
Pdf URL: https://arxiv.org/pdf/2506.07154
Copy Paste: [[2506.07154]] Syntactic Control of Language Models by Posterior Inference(https://arxiv.org/abs/2506.07154)
Keywords: language model, gpt
Abstract: Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability, yet it remains a challenging task. In this paper, we argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation. Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure. Our experiments with GPT2 and Llama3-8B models show that with an appropriate proposal distribution, we can improve syntactic accuracy, increasing the F1 score from $12.31$ (GPT2-large) and $35.33$ (Llama3-8B) to about $93$ in both cases without compromising the language model's fluency. These results underscore both the complexity of syntactic control and the effectiveness of sampling algorithms, offering a promising approach for applications where precise control over syntax is essential.
摘要：控制语言模型产生的文本的句法结构对于需要清晰，风格一致性或解释性的应用程序很有价值，但它仍然是一项艰巨的任务。在本文中，我们认为基于后推理的采样算法可以有效地在生成过程中实现目标选区结构。我们的方法结合了顺序的蒙特卡洛，该蒙特卡洛通过从提案分布中取样来估计后验分布，以及句法标记仪，可确保每个产生的令牌与所需的句法结构对齐。我们对GPT2和LLAMA3-8B模型进行的实验表明，通过适当的提案分布，我们可以提高句法准确性，从而将F1分数从$ 12.31 $（GPT2-LARGE）和35.33美元（Llama3-8B）提高到两种情况下，而无需涉及语言模型的流动性。这些结果强调了句法控制的复杂性和采样算法的有效性，为精确控制语法至关重要的应用提供了有希望的方法。

Title: GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

Authors: Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, Jiaqi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07160
Pdf URL: https://arxiv.org/pdf/2506.07160
Copy Paste: [[2506.07160]] GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization(https://arxiv.org/abs/2506.07160)
Keywords: language model, gpt, llm
Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.
摘要：大型语言模型（LLMS）的最新进展表明，在几何问题解决哪种几何问题中，尤其是在数学推理中，尤其是在数学推理中，辅助构造在哪些几何问题上仍然具有挑战性的领域。现有的方法要么实现次优性能或依赖大规模的LLM（例如GPT-4O），因此产生了巨大的计算成本。我们认为，通过可验证的奖励（例如，GRPO）的增强学习为训练较小的模型提供了有希望的方向，这些模型可以有效地将辅助结构与强大的几何推理结合在一起。但是，直接将GRPO应用于几何推理，由于其对无条件奖励的依赖而产生了基本局限性，这导致了不加选择的和适得其反的辅助结构。为了应对这些挑战，我们提出了团体对比政策优化（GCPO），这是一个新颖的强化学习框架，具有两个关键的创新：（1）组对比度掩饰，该掩饰可适应基于上下文实用程序的辅助构建和（2）长度奖励的辅助构建的正面或负面奖励信号，以促进较长的推理链。在GCPO的基础上，我们开发了一个经济适用大小的几何推理模型的几何泽罗，这些模型明智地确定何时采用辅助结构。我们跨流行的几何基准（几何3K，Mathvista）进行的广泛的经验评估表明，几何学模型始终超过了基层（例如GRPO），在所有基准测试中，平均提高了4.29％的平均改善。

Title: CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Dissertações e Trabalhos de Graduação em SI -- XXI Simpósio Brasileiro de Sistemas de Informação

Authors: Washington Cunha, Leonardo Rocha, Marcos André Gonçalves
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07169
Pdf URL: https://arxiv.org/pdf/2506.07169
Copy Paste: [[2506.07169]] CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Dissertações e Trabalhos de Graduação em SI -- XXI Simpósio Brasileiro de Sistemas de Informação(https://arxiv.org/abs/2506.07169)
Keywords: language model
Abstract: Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This \textbf{Ph.D. dissertation} focuses on an under-investi\-gated NLP data engineering technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task -- Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions. We also propose two novel IS solutions that are noise-oriented and redundancy-aware, specifically designed for large datasets and transformer architectures. Our final solution achieved an average reduction of 41\% in training sets, while maintaining the same levels of effectiveness in all datasets. Importantly, our solutions demonstrated speedup improvements of 1.67x (up to 2.46x), making them scalable for datasets with hundreds of thousands of documents.
摘要：自然语言处理（NLP）的进展已由更多规则决定：更多数据，更多的计算能力和更复杂性，最好用大语言模型来体现。但是，针对特定应用的培训（或微调）大型密集模型通常需要大量的计算资源。此\ textbf {ph.d。论文}着重于不足的NLP数据工程技术，其潜力在当前称为实例选择（IS）的情况下是巨大的。 IS的目标是通过消除嘈杂或冗余实例来减少训练集的规模，同时保持训练有素的模型的有效性并降低培训过程成本。我们考虑了使用多种分类解决方案和许多数据集，我们提供了应用于必不可少的NLP任务（自动文本分类（ATC））的IS方法的全面且科学上的比较。我们的发现揭示了尚未开发的IS解决方案的潜力。我们还提出了两种小说是面向噪声和冗余感知的解决方案，专为大型数据集和变压器体系结构而设计。我们的最终解决方案在训练集中平均减少了41 \％，同时保持所有数据集的有效性水平相同。重要的是，我们的解决方案显示出1.67倍的加速改进（高达2.46倍），这使得它们对于具有数十万个文档的数据集可扩展。

Title: RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality

Authors: Chenlong Zhang, Zhuoran Jin, Hongbang Yuan, Jiaheng Wei, Tong Zhou, Kang Liu, Jun Zhao, Yubo Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07171
Pdf URL: https://arxiv.org/pdf/2506.07171
Copy Paste: [[2506.07171]] RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality(https://arxiv.org/abs/2506.07171)
Keywords: language model, llm
Abstract: The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of the forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget--related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only $12%$ forget set and $8%$ synthesized boundary data, RULE outperforms existing baselines by up to $17.5%$ forget quality and $16.3%$ naturalness response while maintaining general utility, achieving forget--retain Pareto optimality. Remarkably, we further observe that RULE improves the naturalness of model outputs, enhances training efficiency, and exhibits strong generalization ability, generalizing refusal behavior to semantically related but unseen queries.
摘要：大规模，未经过的Corpora培训的大型语言模型（LLM）的广泛部署引起了人们对纳入敏感，版权或非法内容的越来越关注的担忧。这导致对LLM学习的兴趣日益增加：在不从头开始或退化整体实用程序的情况下，选择性地删除模型中的特定信息的任务。但是，现有方法通常依赖于大规模忘记和保留数据集，并且遭受不自然的反应，概括或灾难性效用损失。在这项工作中，我们提出了加固的学习（规则），这是一个有效的框架，将未学习作为拒绝边界优化问题。使用可验证的奖励功能，对忘记集合和综合边界查询的一小部分进行训练，该奖励功能鼓励安全拒绝与忘记相关的查询，同时保留对允许输入的有用响应。我们提供理论和经验证据，证明了规则在不损害模型效用的情况下实现有针对性的学习的有效性。实验结果表明，只有$ 12％$忘记的设置和$ 8％$ $合成的边界数据，规则的表现优于现有基准，高达$ 17.5％$忘记质量和$ 16.3％的$ 16.3％$自然响应，同时保持一般性的公用事业，可实现忘记的遗忘 - 退休 - 退休 - 退休 - 退休 - 退休 - 退休 - 退休。值得注意的是，我们进一步观察到，规则可以提高模型输出的自然性，提高训练效率，并具有强大的概括能力，将拒绝行为推广到语义相关但看不见的查询。

Title: Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Authors: Wenrui Zhou, Shu Yang, Qingsong Yang, Zikun Guo, Lijie Hu, Di Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07180
Pdf URL: https://arxiv.org/pdf/2506.07180
Copy Paste: [[2506.07180]] Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs(https://arxiv.org/abs/2506.07180)
Keywords: language model, llm, prompt
Abstract: As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first dedicated benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the visual domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. In addition, we explore key-frame selection as an interpretable, training-free mitigation strategy, which reveals potential paths for reducing sycophantic bias by strengthening visual grounding.
摘要：随着视频大型语言模型（视频-LLM）越来越多地集成到需要扎根的多模式推理的现实应用程序中，确保其事实的一致性和可靠性至关重要。但是，sycophancy是这些模型与用户输入相一致的趋势，即使它与视觉证据相矛盾，也破坏了它们在这种情况下的可信赖性。当前的无浮力研究在很大程度上忽略了其在视频域中的具体表现，从而显着缺乏系统的基准和有针对性的评估，以了解视频llms在误导性用户输入下的响应方式。为了填补这一空白，我们提出了VISE（视频-LLM Sycophancy基准测试和评估），这是第一个专门的基准测试，旨在评估各种问题格式，及时的偏见和视觉推理任务的最先进的视频群体中的Sycophantic行为。具体而言，Vise Pioneeringly将粘粘体的语言视角带入了视觉域，从而在多种粘粘类型和相互作用模式上进行了细粒度的分析。此外，我们探索了钥匙框的选择作为一种可解释的，无训练的缓解策略，该策略揭示了通过增强视觉接地来减少象征性偏见的潜在途径。

Title: SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes

Authors: Wenxuan Xie, Yaxun Dai, Wenhao Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07245
Pdf URL: https://arxiv.org/pdf/2506.07245
Copy Paste: [[2506.07245]] SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes(https://arxiv.org/abs/2506.07245)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model's ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.
摘要：大型语言模型（LLMS）的最新进展已大大提高了文本到SQL任务的性能。但是，先前的方法通常依赖于推理时间提供的静态，预处理的数据库信息，这限制了模型充分理解数据库内容的能力。如果没有动态交互，LLM将被限制为固定的，人为提供的上下文，并且无法自主探索基础数据。为了解决此限制，我们提出了SDE-SQL，该框架使大型语言模型在推断过程中对数据库进行自我驱动探索。这是通过生成和执行SQL探针来完成的，该探针允许该模型从数据库中积极检索信息并迭代地更新其对数据的理解。与先前的方法不同，SDE-SQL在零弹位设置中运行，而无需依赖任何问题-SQL Pairs作为context演示。当通过QWEN2.5-72B教学进行评估时，SDE-SQL与Vanilla QWEN2.5-72B教学基线的执行精度相对提高了8.02％，建立了基于开放式模型的方法之间的新方法，而无需监督微调（sft）。此外，使用SFT，可以进一步增强SDE-SQL的性能，从而额外提高0.52％。

Title: Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages

Authors: Lance Calvin Lim Gamboa, Yue Feng, Mark Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07249
Pdf URL: https://arxiv.org/pdf/2506.07249
Copy Paste: [[2506.07249]] Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages(https://arxiv.org/abs/2506.07249)
Keywords: language model
Abstract: Emerging research on bias attribution and interpretability have revealed how tokens contribute to biased behavior in language models processing English texts. We build on this line of inquiry by adapting the information-theoretic bias attribution score metric for implementation on models handling agglutinative languages, particularly Filipino. We then demonstrate the effectiveness of our adapted method by using it on a purely Filipino model and on three multilingual models: one trained on languages worldwide and two on Southeast Asian data. Our results show that Filipino models are driven towards bias by words pertaining to people, objects, and relationships, entity-based themes that stand in contrast to the action-heavy nature of bias-contributing themes in English (i.e., criminal, sexual, and prosocial behaviors). These findings point to differences in how English and non-English models process inputs linked to sociodemographic groups and bias.
摘要：关于偏见归因和可解释性的新兴研究揭示了代币如何在处理英语文本的语言模型中有偏见的行为。我们通过调整信息理论偏见归因评分度量，以在处理凝集性语言（尤其是菲律宾语）的模型上实施，以此为基础。然后，我们通过在纯粹的菲律宾模型和三种多语言模型上使用改编方法来证明我们的适应方法的有效性：一种对全球语言进行了训练，两种对东南亚数据进行了培训。我们的结果表明，菲律宾模型是通过与人，对象和关系有关的单词，基于实体的主题来朝着偏见的，这些主题与用英语（即犯罪，性和亲社会行为）偏见的主题相反。这些发现表明，英语和非英语模型如何处理与社会人口统计学组和偏见相关的输入的差异。

Title: Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

Authors: Atahan Özer, Çağatay Yıldız
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07270
Pdf URL: https://arxiv.org/pdf/2506.07270
Copy Paste: [[2506.07270]] Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs(https://arxiv.org/abs/2506.07270)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning thanks to their extensive parametric memory. However, their knowledge is inherently limited by the scope of their pre-training data, while real-world information evolves continuously. Updating this knowledge typically requires costly and brittle re-training, or in-context learning (ICL), which becomes impractical at scale given the volume and volatility of modern information. Motivated by these limitations, we investigate how LLMs perform when exposed to temporal text corpora, or documents that reflect evolving knowledge over time, such as sports biographies where facts like a player's "current team" change year by year. To this end, we introduce two new benchmarks: Temporal Wiki, which captures factual drift across historical Wikipedia snapshots, and Unified Clark, which aggregates timestamped news articles to simulate real-world information accumulation. Our analysis reveals that LLMs often struggle to reconcile conflicting or outdated facts and can be misled when multiple versions of a fact appear in context. To address these issues, we propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training. This knowledge organization strategy enables models to retrieve and reason over temporally filtered, relevant information at inference time. Empirically, our method outperforms ICL and RAG baselines across both benchmarks, especially on questions requiring more complex reasoning or integration of conflicting facts.
摘要：大型语言模型（LLMS）由于其广泛的参数记忆，具有出色的回答和推理功能。但是，他们的知识固有地受到预训练数据的范围的限制，而现实世界的信息则不断发展。更新此知识通常需要昂贵且脆弱的重新训练或内在学习（ICL），鉴于现代信息的数量和波动性，它在大规模上变得不切实际。在这些局限性的驱动下，我们研究了LLM在接触时间文本语料库时的表现，或者反映了随着时间的推移不断发展的知识的文件，例如体育传记，这些事实像玩家的“当前团队”逐年变化。为此，我们介绍了两个新的基准：暂时的Wiki，它捕获了历史悠久的Wikipedia快照和统一Clark的事实漂移，它们汇总了时间戳的新闻文章，以模拟现实世界中的信息积累。我们的分析表明，LLM通常很难调和冲突或过时的事实，并且当事实的多个版本出现在上下文中时可能会被误导。为了解决这些问题，我们提出了一个轻巧的代理框架，该框架可以从源文档中逐步构建一个结构化的外部内存，而无需重新训练。这种知识组织策略使模型能够在推理时检索和推理超过时间过滤的相关信息。从经验上讲，我们的方法在两个基准中都优于ICL和抹布基线，尤其是在需要更复杂的推理或相互矛盾的事实整合的问题上。

Title: Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Authors: Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, Carlos Gómez-Rodríguez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07274
Pdf URL: https://arxiv.org/pdf/2506.07274
Copy Paste: [[2506.07274]] Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages(https://arxiv.org/abs/2506.07274)
Keywords: language model, llm, prompt
Abstract: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at this https URL
摘要：代码切换对句法分析提出了一个复杂的挑战，尤其是在稀缺带注释的数据的低资源语言设置中。虽然最近的工作探索了大型语言模型（LLMS）作为序列级别标记的使用，但很少有方法系统地研究了这些模型在代码转换上下文中捕获句法结构的效果。此外，经过单语言库培训的现有解析器通常无法推广到多语言和混合语言输入。为了解决这一差距，我们介绍了Bilingua Parser，这是一种基于LLM的注释管道，旨在为代码转换文本生成通用依赖关系（UD）注释。首先，我们为西班牙语 - 英语和西班牙 - 加人数据开发了一个基于及时的框架，结合了很少的llm llm和专家审查。其次，我们发布了两个带注释的数据集，其中包括第一个西班牙 - 乌拉亚尼（GuaraníUd）份额的语料库。第三，我们对语言对和交流环境之间的开关点进行了详细的句法分析。实验结果表明，Bilingua Parser在专家修订后达到了高达95.29％的LAS，大大优于先前的基线和多语言解析器。这些结果表明，在仔细指导的情况下，LLMS可以用作资源不足，代码切换环境中引导句法资源的实用工具。数据和源代码可在此HTTPS URL上找到

Title: Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

Authors: Lujun Li, Lama Sleem, Niccolo' Gentile, Geoffrey Nichil, Radu State
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07295
Pdf URL: https://arxiv.org/pdf/2506.07295
Copy Paste: [[2506.07295]] Exploring the Impact of Temperature on Large Language Models:Hot or Cold?(https://arxiv.org/abs/2506.07295)
Keywords: language model, llm, prompt
Abstract: The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.
摘要：采样温度是大语言模型（LLMS）中的关键超参数，可修改软件层之前的逻辑，从而重塑输出令牌的分布。最近的研究通过证明LLM能够理解语义而不是仅仅记忆数据，而通过采样温度调节的随机性在模型推论中起着至关重要的作用，从而挑战了随机鹦鹉的类比。在这项研究中，我们系统地评估了0到2范围内温度对旨在评估六种不同功能的数据集的影响，对三种不同尺寸的开源模型进行了统计分析：小（1b-----------4b），培养基（6b------13b），大型（40b----------------------------------------------）。我们的发现揭示了温度对模型性能的不同技能特异性影响，突出了实际应用中最佳温度选择的复杂性。为了应对这一挑战，我们提出了一个基于BERT的温度选择器，该温度选择器利用这些观察到的效果来确定给定提示的最佳温度。我们证明，这种方法可以显着提高超级插图数据集中中小型模型的性能。此外，我们的研究扩展到FP16的精度推断，表明温度效应与4位量化模型中观察到的效应一致。通过评估三个量化模型中的温度效应高达4.0，我们发现突变温度（发生重大性能变化的点）随着模型大小而增加。

Title: ConfQA: Answer Only If You Are Confident

Authors: Yin Huang, Yifan Ethan Xu, Kai Sun, Vera Yan, Alicia Sun, Haidar Khan, Jimmy Nguyen, Mohammad Kachuee, Zhaojiang Lin, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07309
Pdf URL: https://arxiv.org/pdf/2506.07309
Copy Paste: [[2506.07309]] ConfQA: Answer Only If You Are Confident(https://arxiv.org/abs/2506.07309)
Keywords: language model, llm, hallucination, prompt
Abstract: Can we teach Large Language Models (LLMs) to refrain from hallucinating factual statements? In this paper we present a fine-tuning strategy that we call ConfQA, which can reduce hallucination rate from 20-40% to under 5% across multiple factuality benchmarks. The core idea is simple: when the LLM answers a question correctly, it is trained to continue with the answer; otherwise, it is trained to admit "I am unsure". But there are two key factors that make the training highly effective. First, we introduce a dampening prompt "answer only if you are confident" to explicitly guide the behavior, without which hallucination remains high as 15%-25%. Second, we leverage simple factual statements, specifically attribute values from knowledge graphs, to help LLMs calibrate the confidence, resulting in robust generalization across domains and question types. Building on this insight, we propose the Dual Neural Knowledge framework, which seamlessly select between internally parameterized neural knowledge and externally recorded symbolic knowledge based on ConfQA's confidence. The framework enables potential accuracy gains to beyond 95%, while reducing unnecessary external retrievals by over 30%.
摘要：我们可以教大型语言模型（LLM）避免幻觉的事实陈述吗？在本文中，我们提出了一种微调策略，我们称之为ConfQA，在多个事实基准中，幻觉率可以从20-40％降低到5％以下。核心想法很简单：当LLM正确回答问题时，经过培训可以继续答案；否则，接受训练可以承认“我不确定”。但是，有两个关键因素使培训高效。首先，我们引入了一个令人沮丧的及时及时“仅当您有信心的情况下”来明确指导这种行为，没有幻觉仍然高达15％-25％。其次，我们利用简单的事实语句，特别是从知识图中归因的值，以帮助LLMS校准置信度，从而在范围内和问题类型之间进行稳健的概括。在此洞察力的基础上，我们提出了双重神经知识框架，该框架在内部参数化的神经知识和基于ConfQA的信心之间的外部参数化的象征知识之间进行了无缝选择。该框架使潜在的准确性提高到95％以上，同时将不必要的外部检索减少了30％以上。

Title: Reward Model Interpretability via Optimal and Pessimal Tokens

Authors: Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07326
Pdf URL: https://arxiv.org/pdf/2506.07326
Copy Paste: [[2506.07326]] Reward Model Interpretability via Optimal and Pessimal Tokens(https://arxiv.org/abs/2506.07326)
Keywords: language model, prompt
Abstract: Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.
摘要：奖励建模已成为将大语言模型与人类价值观对齐的关键组成部分。重大关注的重点是使用奖励模型作为微调生成模型的手段。但是，奖励模型本身（通过将及时响应对通过变成标量奖励来直接编码人类价值判断的奖励模型本身 - 仍然相对研究。我们提出了一种新颖的方法，可以通过对整个词汇空间中的响应进行详尽的分析来奖励模型可解释性。 By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent令牌。我们在最近的十个不同参数计数和体系结构的十个开源奖励模型中演示了这些效果。我们的结果挑战了关于奖励模型的互换性的假设，以及它们作为复杂和上下文依赖人类价值观的代理。我们发现，这些模型可以编码有关某些身份群体的偏见，这可能是无害训练的意外后果 - 扭曲风险通过现在部署到数百万的下游大语模型传播的风险。

Title: Improving LLM Reasoning through Interpretable Role-Playing Steering

Authors: Anyi Wang, Dong Shu, Yifan Wang, Yunpu Ma, Mengnan Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07335
Pdf URL: https://arxiv.org/pdf/2506.07335
Copy Paste: [[2506.07335]] Improving LLM Reasoning through Interpretable Role-Playing Steering(https://arxiv.org/abs/2506.07335)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model's residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.
摘要：角色扮演已成为增强大语言模型（LLM）推理能力的有效技术。但是，现有方法主要依赖于及时的工程，而工程通常缺乏稳定性和解释性。在本文中，我们介绍了稀疏的自动编码器角色扮演转向（SRP），该转向（SRPS）是一个新颖的框架，识别并操纵与角色扮演行为相关的内部模型特征。我们的方法从角色扮演提示中提取潜在表示，根据激活模式选择最相关的功能，并构建一个可以将其注入具有可控强度的模型残留流中的转向向量。我们的方法可以对特定角色的行为进行细粒度的控制，并提供有关角色信息如何影响内部模型激活的见解。各种推理基准和模型尺寸的广泛实验表明了性能的一致性。值得注意的是，在零射链（COT）设置中，CSQA上Llama3.1-8b的准确性从31.86％提高到39.80％，而SVAMP的Gemma2-9b从37.50％提高到45.10％。这些结果突出了SRP提高LLM中推理能力的潜力，与传统的基于及时的角色扮演相比，提供了更好的解释性和稳定性。

Title: Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation

Authors: Seokil Ham, Yubin Choi, Seungju Cho, Yujin Yang, Younghun Kim, Changick Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07356
Pdf URL: https://arxiv.org/pdf/2506.07356
Copy Paste: [[2506.07356]] Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation(https://arxiv.org/abs/2506.07356)
Keywords: language model, llm, prompt
Abstract: Recently, major AI service providers such as Google and OpenAI have introduced Finetuning-as-a-Service, which enables users to customize Large Language Models (LLMs) for specific downstream tasks using their own data. However, this service is vulnerable to degradation of LLM safety-alignment when user data contains harmful prompts. While some prior works address this issue, fundamentally filtering harmful data from user data remains unexplored. Motivated by our observation that a directional representation reflecting refusal behavior (called the refusal feature) obtained from safety-aligned LLMs can inherently distinguish between harmful and harmless prompts, we propose the Refusal-Feature-guided Teacher (ReFT). Our ReFT model is trained to identify harmful prompts based on the similarity between input prompt features and its refusal feature. During finetuning, the ReFT model serves as a teacher that filters harmful prompts from user data and distills alignment knowledge into the base model. Extensive experiments demonstrate that our ReFT-based finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in Finetuning-as-a-Service.
摘要：最近，诸如Google和OpenAI之类的主要AI服务提供商推出了Finetuning-As-A-Service，该服务提供了使用自己的数据，使用户可以使用自己的数据自定义大型语言模型（LLMS）。但是，当用户数据包含有害的提示时，此服务很容易受到LLM安全分组的降解。尽管一些先前的工作解决了这个问题，但从根本上讲，从用户数据中过滤有害数据仍未探索。我们的观察是，我们的观察是反映了从安全成立的LLMS获得的拒绝行为（称为拒绝特征）的方向表示，可以固有地区分有害和无害提示，我们建议拒绝 - 特征引导的教师（REFT）。我们的REFT模型经过训练，可以根据输入提示功能及其拒绝功能之间的相似性来识别有害的提示。在填充过程中，REFT模型是一名教师，从用户数据中过滤有害提示并将对齐知识提高到基本模型中。广泛的实验表明，我们基于REFT的固定策略有效地最大程度地减少了有害产出，并提高了特定用户特定任务的明迹准确性，从而提供了一种实用解决方案，可用于在finetuning-a-Service中安全可靠地部署LLMS。

Title: Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models

Authors: Kyeonghyun Kim, Jinhee Jang, Juhwan Choi, Yoonji Lee, Kyohoon Jin, YoungBin Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07424
Pdf URL: https://arxiv.org/pdf/2506.07424
Copy Paste: [[2506.07424]] Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models(https://arxiv.org/abs/2506.07424)
Keywords: language model, llm
Abstract: Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel framework that combines the strengths of both LLMs and SLMs to achieve high performance while maintaining efficiency. PiFi integrates a single frozen layer from an LLM into a SLM and fine-tunes the combined model for specific tasks, boosting performance without a significant increase in computational cost. We show that PiFi delivers consistent performance improvements across a range of natural language processing tasks, including both natural language understanding and generation. Moreover, our findings demonstrate PiFi's ability to effectively leverage LLM knowledge, enhancing generalization to unseen domains and facilitating the transfer of linguistic abilities.
摘要：大型语言模型（LLM）以其广泛的语言知识和强大的概括能力而闻名，但是它们的高计算需求使其不适合资源受限的环境。相比之下，小语言模型（SLM）是计算上有效的，但通常缺乏LLM的广泛概括能力。为了弥合这一差距，我们提出了PIFI，这是一个新颖的框架，结合了LLM和SLM的优势，以达到高性能的同时保持效率。 PIFI将一个从LLM的单个冷冻层集成到SLM中，并微调组合模型以进行特定任务，从而提高了性能，而没有显着增加计算成本。我们表明，PIFI在一系列自然语言处理任务（包括自然语言理解和产生）中提供一致的性能提高。此外，我们的发现表明了PIFI有效利用LLM知识的能力，增强对看不见领域的概括并促进语言能力的传递。

Title: Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

Authors: Feifan Song, Shaohang Wei, Wen Luo, Yuxuan Fan, Tianyu Liu, Guoyin Wang, Houfeng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07434
Pdf URL: https://arxiv.org/pdf/2506.07434
Copy Paste: [[2506.07434]] Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding(https://arxiv.org/abs/2506.07434)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.
摘要：大型语言模型（LLM）需要与人类偏好保持一致，以避免产生令人反感，虚假或毫无意义的内容。最近，LLM对准的低资源方法很受欢迎，同时仍然面临着获得高质量和一致性内容的挑战。通过观察到产生对齐响应的困难集中在解码开始时的动机，我们提出了一个新颖的框架，即弱到较强的解码（WSD），以通过小型对齐模型的指导来增强基本模型的对齐能力。小型模型初稿良好的起点，然后是大型基本模型，以继续其余的自动切换机制控制。我们还收集了一个新的数据集，即GeneralIgn，以微调小型Pilot-3B作为草案模型，该模型有效地增强了WSD框架下的不同基本模型，以胜过所有基线方法，同时避免在下游任务上降级（称为对准税）。进一步进行了广泛的实验，以检查不同设置和时间效率的影响，并对WSD深度的内在机制进行了分析。

Title: LG-ANNA-Embedding technical report

Authors: Jooyoung Choi, Hyun Kim, Hansol Jang, Changwook Jun, Kyunghoon Bae, Hyewon Choi, Stanley Jungkyu Choi, Honglak Lee, Chulmin Yun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07438
Pdf URL: https://arxiv.org/pdf/2506.07438
Copy Paste: [[2506.07438]] LG-ANNA-Embedding technical report(https://arxiv.org/abs/2506.07438)
Keywords: language model, prompt
Abstract: This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.
摘要：本报告提出了一个基于统一的指令框架，用于学习针对信息检索（IR）和非IR任务优化的通用文本嵌入。我们的方法建立在仅解码器的大型语言模型（Mistral-7b）的基础上，结合了内在的学习，软监督和自适应硬性阴性挖掘，以生成无需特定于任务的微调的上下文感知的嵌入。结构化指令和很少的示例用于指导跨不同任务的模型，从而在分类，语义相似性，聚类和重新掌握基准方面具有强大的性能。为了改善语义歧视，我们采用了一个软标签框架，在该框架中，从高性能密集的猎犬和reranker中提取的连续相关性得分是细粒度的监督信号。此外，我们引入了基于自适应的硬性挖掘，该开采基于与积极例子的相似性过滤了语义上模棱两可的负面因素，从而增强了训练稳定性和检索稳健性。我们的模型对新引入的MTEB（英语，V2）基准进行了评估，涵盖了七个类别的41个任务。结果表明，我们的方法实现了强大的概括，并通过Borda评分在最佳模型中排名，表现优于几个较大或完全微调的基线。这些发现突出了将上下文提示，软监督和自适应采样相结合的有效性，以进行可扩展的高质量嵌入生成。

Title: KScope: A Framework for Characterizing the Knowledge Status of Language Models

Authors: Yuxin Xiao, Shan Chen, Jack Gallifant, Danielle Bitterman, Thomas Hartvigsen, Marzyeh Ghassemi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07458
Pdf URL: https://arxiv.org/pdf/2506.07458
Copy Paste: [[2506.07458]] KScope: A Framework for Characterizing the Knowledge Status of Language Models(https://arxiv.org/abs/2506.07458)
Keywords: language model, llm
Abstract: Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.
摘要：表征大型语言模型（LLM）对给定问题的知识是具有挑战性的。结果，先前的工作主要检查了知识冲突下的LLM行为，其中模型的内部参数内存与外部环境中的信息相矛盾。但是，这并不能完全反映模型对问题的答案的了解程度。在本文中，我们首先根据LLM知识模式的一致性和正确性介绍了五个知识状态的分类学。然后，我们提出了KScope，这是统计检验的层次结构框架，逐步完善有关知识模式的假设，并将LLM知识表征为这五个状态之一。我们将KScope应用于四个数据集中的9个LLM，并系统地建立：（1）支持上下文狭窄模型的知识差距。（2）与困难，相关性和熟悉度有关的上下文功能推动了知识更新。（3）当部分正确或冲突时，LLMS表现出相似的特征偏好，但在持续错误时会急剧分歧。（4）通过我们的功能分析限制的上下文摘要以及增强的信誉，进一步提高了更新的效力和跨LLM的概括。

Title: From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered

Authors: Siddartha Devic, Tejas Srinivasan, Jesse Thomason, Willie Neiswanger, Vatsal Sharan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07461
Pdf URL: https://arxiv.org/pdf/2506.07461
Copy Paste: [[2506.07461]] From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered(https://arxiv.org/abs/2506.07461)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community's progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.
摘要：大型语言模型（LLM）越来越多地为现实世界中的用户提供帮助，但其可靠性仍然是一个令人关注的问题。不确定性量化（UQ）已被预示为通过使用户知道何时信任LLM预测来增强人类合作的工具。我们认为，在LLM中，当前的不确定性量化实践对于为在现实世界任务中做出决策的人类用户开发有用的UQ并不是最佳的。通过对40个LLM UQ方法的分析，我们确定了三种普遍的做法，阻碍了社区的进步，以使下游用户受益的目标：1）评估生态有效性低的基准； 2）仅考虑认知不确定性； 3）优化不一定指示下游实用程序的指标。对于每个问题，我们提出了LLM UQ研究人员应考虑的具体用户中心的实践和研究方向。我们认为，与其使用不完美的指标对非代表性任务进行攀爬，不如说社区应采用以人为本的LLM不确定性量化方法。

Title: CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Authors: Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07463
Pdf URL: https://arxiv.org/pdf/2506.07463
Copy Paste: [[2506.07463]] CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models(https://arxiv.org/abs/2506.07463)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.
摘要：我们介绍了CCI4.0，这是一种大型双语预训练数据集，该数据集设计为出色的数据质量和类似人类的推理轨迹。 CCI4.0大约占有$ 35 $ TB的磁盘空间，包括两个子数据库：CCI4.0-M2-碱和CCI4.0-M2-COT。 CCI4.0-M2-键合并了$ 5.2 $ TB精心策划的中国网络语料库，Nemotron-CC的$ 22.5 $ TB英语子集，以及来自Math，Wiki，Arxiv和Code的不同来源。尽管这些数据主要来自处理良好的数据集，但各个领域的质量标准是动态的，需要丰富的专家经验和劳动来处理。因此，我们提出了一条新的管道，主要通过模型，通过两阶段重复数据删除，多分类器质量评分和域感知的流利度过滤来证明数据质量的合理性。我们提取了45亿美元的COT（经营链）模板，称为CCI4.0-M2-COT。与大型模型的蒸馏不同，我们提出的分阶段cot提取体现了各种推理模式，并显着降低了幻觉的可能性。经验评估表明，在CCI4.0中预先培训的LLM受益于清洁剂，更可靠的培训信号，从而在下游任务方面持续改进，尤其是在数学和代码反射任务中。我们的结果强调了严格的数据策展和人类思维模板在推进LLM绩效方面的关键作用，从而阐明了自动处理验证的语料库。

Title: Improving Fairness of Large Language Models in Multi-document Summarization

Authors: Haoyuan Li Yusen Zhang, Snigdha Chaturvedi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07479
Pdf URL: https://arxiv.org/pdf/2506.07479
Copy Paste: [[2506.07479]] Improving Fairness of Large Language Models in Multi-document Summarization(https://arxiv.org/abs/2506.07479)
Keywords: language model
Abstract: Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at this https URL.
摘要：多文章摘要（MDS）的公平性对于在具有不同社会属性价值的文档中提供全面观点至关重要，这可能会对决策产生重大影响。例如，一个倾向于过度表达产品的负面评论的汇总系统会误导客户无视好产品。先前的作品在两个层面上测量了MDS的公平性：摘要级别和语料库级别。尽管摘要级别的公平性集中在单个摘要上，但语料库级的公平侧重于摘要的语料库。最近的方法主要集中于摘要级别的公平性。我们提出了Fairpo，这是一种偏好调整方法，侧重于MDS中的摘要级别和语料库级公平。为了提高摘要级别的公平性，我们建议通过扰动文档集生成偏好对。为了提高语料库级别的公平性，我们建议通过动态调整偏好对的权重来提出公平感知的偏好调整。我们的实验表明，Fairpo在保持摘要的关键质量的同时，胜过强大的基线。该代码可在此HTTPS URL上找到。

Title: A Hybrid GA LLM Framework for Structured Task Optimization

Authors: Berry Feng, Jonas Lin, Patrick Lau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07483
Pdf URL: https://arxiv.org/pdf/2506.07483
Copy Paste: [[2506.07483]] A Hybrid GA LLM Framework for Structured Task Optimization(https://arxiv.org/abs/2506.07483)
Keywords: language model, llm
Abstract: GA LLM is a hybrid framework that combines Genetic Algorithms with Large Language Models to handle structured generation tasks under strict constraints. Each output, such as a plan or report, is treated as a gene, and evolutionary operations like selection, crossover, and mutation are guided by the language model to iteratively improve solutions. The language model provides domain knowledge and creative variation, while the genetic algorithm ensures structural integrity and global optimization. GA LLM has proven effective in tasks such as itinerary planning, academic outlining, and business reporting, consistently producing well structured and requirement satisfying results. Its modular design also makes it easy to adapt to new tasks. Compared to using a language model alone, GA LLM achieves better constraint satisfaction and higher quality solutions by combining the strengths of both components.
摘要：GA LLM是一个混合框架，将遗传算法与大语言模型相结合，以在严格的约束下处理结构化的生成任务。每个输出（例如计划或报告）都被视为一个基因，而诸如选择，交叉和突变之类的进化操作都以语言模型为指导，以改善解决方案。语言模型提供了领域知识和创造性变化，而遗传算法则确保结构完整性和全球优化。 GA LLM已被证明在诸如行程计划，学术概述和业务报告等任务中有效，并始终产生结构良好的需求满足结果。它的模块化设计还使得适应新任务。与仅使用语言模型相比，GA LLM通过结合两个组件的优势来实现更好的约束满意度和更高质量的解决方案。

Title: DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech

Authors: Haotian Guo, Jing Han, Yongfeng Tu, Shihao Gao, Shengfan Shen, Wulong Xiang, Weihao Gan, Zixing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07502
Pdf URL: https://arxiv.org/pdf/2506.07502
Copy Paste: [[2506.07502]] DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech(https://arxiv.org/abs/2506.07502)
Keywords: language model
Abstract: Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and intonation-can help resolve textual ambiguity and reveal a speaker's true intent. DEBATE contains 1,001 carefully selected ambiguous utterances, each recorded by 10 native speakers, capturing diverse linguistic ambiguities and their disambiguation through speech. We detail the data collection pipeline and provide rigorous quality analysis. Additionally, we benchmark three state-of-the-art large speech and audio-language models, illustrating clear and huge performance gaps between machine and human understanding of spoken intent. DEBATE represents the first effort of its kind and offers a foundation for building similar DTS datasets across languages and cultures. The dataset and associated code are available at: this https URL.
摘要：尽管对文本和视觉上的歧义进行了广泛的研究，但通过语音（DTS）的歧义仍未得到充实。这在很大程度上是由于缺乏高质量的数据集，这些数据集将句子与含糊不清的文字配对。为了解决这一差距，我们提出了辩论，这是一个独特的中国公共语音文本数据集，旨在研究语音提示和模式如何预言，暂停，压力和语调范围，有助于解决文本歧义并揭示说话者的真实意图。辩论包含1,001个精心选择的歧义话语，每个话语都是由10位母语者记录的，通过言论捕捉了多样化的语言歧义及其歧义。我们详细说明数据收集管道并提供严格的质量分析。此外，我们基准了三个最先进的大型语音和音频语言模型，这说明了机器和人类对口语意图之间的明确而巨大的性能差距。辩论代表着同类的第一个努力，并为跨语言和文化构建类似的DTS数据集奠定了基础。数据集和关联的代码可在以下网址提供：此HTTPS URL。

Title: Towards Large Language Models with Self-Consistent Natural Language Explanations

Authors: Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07523
Pdf URL: https://arxiv.org/pdf/2506.07523
Copy Paste: [[2506.07523]] Towards Large Language Models with Self-Consistent Natural Language Explanations(https://arxiv.org/abs/2506.07523)
Keywords: language model, llm
Abstract: Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.
摘要：大型语言模型（LLM）似乎为解释性提供了简单的途径：只需要求他们解释自己的决定即可。然而，研究表明，这些事后解释常常歪曲了真实的决策过程，这在特征重要性方面揭示了。尽管越来越多的证据表明这种不一致，但没有出现系统的解决方案，部分原因是估计特征重要性的成本很高，这将评估限制在小数据集中。为了解决这个问题，我们介绍了事后自称库（PSCB） - 涵盖各种任务和模型的决策的大规模基准，每种决策都与LLM生成的解释和相应的功能重要性得分配对。对PSCB的分析表明，正确和错误的预测之间的自洽得分几乎没有差异。我们还表明，标准指标无法有意义区分解释。为了克服这一限制，我们提出了一个替代指标，可以更有效地捕获解释质量的变化。我们将其用于通过直接偏好优化（DPO）进行微调LLM，从而导致解释和与决策相关的特征之间的比对明显更好，甚至在域移动下也是如此。我们的发现表明，通往更可信赖，自以为是的LLM的可扩展道路。

Title: Bit-level BPE: Below the byte boundary

Authors: Sangwhan Moon, Tatsuya Hiraoka, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07541
Pdf URL: https://arxiv.org/pdf/2506.07541
Copy Paste: [[2506.07541]] Bit-level BPE: Below the byte boundary(https://arxiv.org/abs/2506.07541)
Keywords: language model
Abstract: Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.
摘要：字节级后备用于子字令牌化已成为大语言模型中的一种常见实践。特别是，它已被证明是防止OOV的实用解决方案非常有效的，尤其是在较大模型的背景下。但是，将角色划分为个体字节会显着增加诸如中文，日语和韩语（CJK）等语言的长尾令牌的序列长度以及其他角色多样性上下文，例如表情符号。序列长度的增加导致训练和推理期间的计算更长。在这项工作中，我们提出了一种简单的压缩技术，该技术可无损地降低序列长度。

Title: SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition

Authors: Mengsong Wu, Di Zhang, Yuqiang Li, Dongzhan Zhou, Wenliang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07557
Pdf URL: https://arxiv.org/pdf/2506.07557
Copy Paste: [[2506.07557]] SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition(https://arxiv.org/abs/2506.07557)
Keywords: language model, llm, hallucination
Abstract: While Large Language Models (LLMs) have achieved remarkable success in a wide range of applications, their performance often degrades in complex reasoning tasks. In this work, we introduce SELT (Self-Evaluation LLM Tree Search), a novel framework that leverages a modified Monte Carlo Tree Search (MCTS) to enhance LLM reasoning without relying on external reward models. By redefining the Upper Confidence Bound scoring to align with intrinsic self-evaluation capabilities of LLMs and decomposing the inference process into atomic subtasks augmented with semantic clustering at each node, SELT effectively balances exploration and exploitation, reduces redundant reasoning paths, and mitigates hallucination. We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools, where SELT achieves significant improvements in answer accuracy and reasoning robustness compared to baseline methods. Notably, our framework operates without task-specific fine-tuning, demonstrating strong generalizability across diverse reasoning tasks. Relevant results and code are available at this https URL .
摘要：尽管大型语言模型（LLMS）在广泛的应用中取得了巨大的成功，但它们的性能经常在复杂的推理任务中降低。在这项工作中，我们介绍了SELT（自我评估LLM Tree搜索），这是一个新颖的框架，利用修改后的蒙特卡洛树搜索（MCT）在不依赖外部奖励模型的情况下增强LLM推理。通过重新定义LLM的固有自我评估能力并将推理过程分解为在每个节点的语义聚类增强的原子子任务中，SELT有效平衡探索和剥削，减少了冗余的推理途径，并减少了幻觉。我们验证了我们在挑战基准方面的方法，包括基于知识的MMLU和工具学习数据集密封工具，与基线方法相比，SELT在答案的准确性和鲁棒性方面取得了重大提高。值得注意的是，我们的框架在没有特定于任务的微调的情况下运行，这表明了各种推理任务的强大可推广性。相关结果和代码可在此HTTPS URL上找到。

Title: Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models

Authors: Ramakrishna Appicharla, Baban Gain, Santanu Pal, Asif Ekbal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07583
Pdf URL: https://arxiv.org/pdf/2506.07583
Copy Paste: [[2506.07583]] Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models(https://arxiv.org/abs/2506.07583)
Keywords: language model, gpt, llm, prompt, chat, agent
Abstract: Despite the popularity of the large language models (LLMs), their application to machine translation is relatively underexplored, especially in context-aware settings. This work presents a literature review of context-aware translation with LLMs. The existing works utilise prompting and fine-tuning approaches, with few focusing on automatic post-editing and creating translation agents for context-aware machine translation. We observed that the commercial LLMs (such as ChatGPT and Tower LLM) achieved better results than the open-source LLMs (such as Llama and Bloom LLMs), and prompt-based approaches serve as good baselines to assess the quality of translations. Finally, we present some interesting future directions to explore.
摘要：尽管大语言模型（LLMS）很受欢迎，但它们在机器翻译中的应用相对却相对不受欢迎，尤其是在上下文感知的设置中。这项工作介绍了与LLM的上下文感知翻译的文献综述。现有作品利用提示和微调方法，很少有人专注于自动编辑和创建用于上下文感知机器翻译的翻译代理。我们观察到，商业LLM（例如ChatGpt和Tower LLM）比开源LLM（例如Llama和Bloom LLM）获得了更好的结果，并且基于及时的方法是评估翻译质量的良好基础。最后，我们提出了一些有趣的未来探索方向。

Title: Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Authors: Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre, Mikel Artetxe, Aitor Soroa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07597
Pdf URL: https://arxiv.org/pdf/2506.07597
Copy Paste: [[2506.07597]] Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque(https://arxiv.org/abs/2506.07597)
Keywords: language model, llm
Abstract: Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model, and improved results when scaling up. Using Llama 3.1 instruct 70B as backbone our model comes near frontier models of much larger sizes for Basque, without using any Basque data apart from the 1.2B word corpora. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
摘要：具有用户意图的语言模型需要大量的指令数据集，这仅适用于有限的语言。在本文中，我们探讨了低资源场景中常规教学适应管道的替代方案。我们假设低资源语言的现实场景，其中只有以下可用的情况：目标语言，现有的开放重量多语言基础和指示的骨干LLM，以及从指示的骨干链采样的合成生成的指令。我们提供了一组巴斯克克的实验集，该实验系统地研究了这些组件的不同组合，这些组件对1,680名参与者的基准和人类偏好进行了评估。我们的结论表明，目标语言语言是必不可少的，其合成指令产生了可靠的模型，最重要的是，用作骨架使用指令调节的模型使用基本非教学模型优于表现，并在扩展时改进结果。使用Llama 3.1指示70B作为骨架，我们的模型靠近Basque大尺寸的Frontier型号，而无需使用1.2B Word Corpora以外的任何Basque数据。我们发布代码，模型，指令数据集和人类偏好，以支持对低资源语言适应的未来研究中的完全可重复性。

Title: PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels

Authors: Peyman Rostami, Vahid Rahimzadeh, Ali Adibi, Azadeh Shakery
Subjects: cs.CL, cs.AI, cs.IR, cs.SI
Abstract URL: https://arxiv.org/abs/2506.07606
Pdf URL: https://arxiv.org/pdf/2506.07606
Copy Paste: [[2506.07606]] PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels(https://arxiv.org/abs/2506.07606)
Keywords: language model, llm
Abstract: Stance detection identifies the viewpoint expressed in text toward a specific target, such as a political figure. While previous datasets have focused primarily on tweet-level stances from established platforms, user-level stance resources, especially on emerging platforms like Bluesky remain scarce. User-level stance detection provides a more holistic view by considering a user's complete posting history rather than isolated posts. We present the first stance detection dataset for the 2024 U.S. presidential election, collected from Bluesky and centered on Kamala Harris and Donald Trump. The dataset comprises 16,044 user-target stance pairs enriched with engagement metadata, interaction graphs, and user posting histories. PolitiSky24 was created using a carefully evaluated pipeline combining advanced information retrieval and large language models, which generates stance labels with supporting rationales and text spans for transparency. The labeling approach achieves 81\% accuracy with scalable LLMs. This resource addresses gaps in political stance analysis through its timeliness, open-data nature, and user-level perspective. The dataset is available at this https URL
摘要：立场检测确定了文本中对特定目标（例如政治人物）所表达的观点。虽然以前的数据集主要集中在已建立平台的推文级别上，但用户级的立场资源，尤其是在蓝军等新兴平台上仍然很少。用户级别的立场检测通过考虑用户完整的发布历史记录而不是孤立的帖子来提供更全面的视图。我们介绍了2024年美国总统大选的第一个立场检测数据集，该数据集是从蓝军收集的，以卡马拉·哈里斯和唐纳德·特朗普为中心。该数据集包含16,044个用户目标姿势对，具有互动元数据，交互图和用户发布历史记录。 Politisky24是使用经过精心评估的管道创建的，该管道结合了高级信息检索和大型语言模型，该模型与支持原理和文本跨度生成了姿态标签，以达到透明度。标签方法通过可扩展的LLM达到81 \％精度。该资源通过其及时性，开放数据和用户级别的观点来解决政治立场分析中的差距。该数据集可在此HTTPS URL上找到

Title: Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation

Authors: Roman Kyslyi, Yuliia Maksymiuk, Ihor Pysmennyi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07617
Pdf URL: https://arxiv.org/pdf/2506.07617
Copy Paste: [[2506.07617]] Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation(https://arxiv.org/abs/2506.07617)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: this https URL
摘要：在本文中，我们介绍了将大型语言模型（LLM）调整到乌克兰方言（在我们的情况下）的第一个努力，这是在喀尔巴阡高地中使用的低资源和形态上复杂的方言。我们创建了一个平行的9852言语语料库，乌克兰句子对和7320个方言映射的词典。我们还通过提出高级检索增强生成（RAG）管道来生成合成平行翻译对，以52142个示例扩展语料库来解决数据短缺。我们使用LORA进行了微调的多个开源LLM，并在标准转换任务上对其进行了评估，还与很少的GPT-4O翻译进行了比较。在没有人类注释者的情况下，我们采用了将BLEU，CHRF ++，TER和LLM基于LLM的判断（GPT-4O）结合的多项式评估策略（GPT-4O）。结果表明，即使是较小的（7b）芬特模型，在自动和LLM评估的指标上，诸如GPT-4O之类的零弹药基线的表现都优于零弹药基线。所有数据，模型和代码均在以下公开发布：此HTTPS URL

Title: LoRMA: Low-Rank Multiplicative Adaptation for LLMs

Authors: Harsh Bihany, Shubham Patel, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07621
Pdf URL: https://arxiv.org/pdf/2506.07621
Copy Paste: [[2506.07621]] LoRMA: Low-Rank Multiplicative Adaptation for LLMs(https://arxiv.org/abs/2506.07621)
Keywords: language model, llm
Abstract: Large Language Models have shown remarkable capabilities in the NLP domain. Their effectiveness can mainly be attributed to their ability to adapt to an array of downstream tasks. However, generally, full fine-tuning is a computationally expensive job. To mitigate this, many techniques have been developed that prime efficiency, a prominent one being Low-Rank Adaptation (LoRA). However, LoRA and its variants employ re-parametrized additive updates. In this paper, we propose Low-Rank Multiplicative Adaptation (LoRMA), which shifts the paradigm of additive updates to a richer space of matrix multiplicative transformations. We tackle challenges such as computational complexity and rank bottleneck of matrix multiplication by effectively re-ordering operations and introducing rank inflation strategies. We conduct extensive experiments to demonstrate the effectiveness of our approach in terms of various evaluation metrics.
摘要：大型语言模型在NLP域显示出了显着的功能。它们的有效性主要归因于它们适应一系列下游任务的能力。但是，通常，完整的微调是一项计算昂贵的工作。为了减轻这种情况，已经开发出许多技术效率，这是一种突出的效率，这是一种低级适应（Lora）。但是，洛拉及其变体采用重新构造的添加剂更新。在本文中，我们提出了低级乘法适应性（LORMA），该适应性将添加剂更新的范式转移到矩阵乘法转换的较丰富空间。我们通过有效地重新排序操作并引入等级通货膨胀策略来应对计算复杂性和矩阵乘法等级瓶颈等挑战。我们进行了广泛的实验，以证明我们的方法在各种评估指标方面的有效性。

Title: Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation

Authors: Kseniia Petukhova, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07626
Pdf URL: https://arxiv.org/pdf/2506.07626
Copy Paste: [[2506.07626]] Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation(https://arxiv.org/abs/2506.07626)
Keywords: language model, llm
Abstract: Large language models (LLMs) hold great promise for educational applications, particularly in intelligent tutoring systems. However, effective tutoring requires alignment with pedagogical strategies - something current LLMs lack without task-specific adaptation. In this work, we explore whether fine-grained annotation of teacher intents can improve the quality of LLM-generated tutoring responses. We focus on MathDial, a dialog dataset for math instruction, and apply an automated annotation framework to re-annotate a portion of the dataset using a detailed taxonomy of eleven pedagogical intents. We then fine-tune an LLM using these new annotations and compare its performance to models trained on the original four-category taxonomy. Both automatic and qualitative evaluations show that the fine-grained model produces more pedagogically aligned and effective responses. Our findings highlight the value of intent specificity for controlled text generation in educational settings, and we release our annotated data and code to facilitate further research.
摘要：大型语言模型（LLM）对教育应用具有巨大的希望，尤其是在智能辅导系统中。但是，有效的辅导需要与教学策略保持一致 - 目前的LLMS缺乏而没有特定任务的适应性。在这项工作中，我们探讨了对教师意图的细粒度注释是否可以提高LLM生成的辅导反应的质量。我们专注于数学数据集，用于数学指令的对话框数据集，并应用一个自动注释框架，使用11个教学意图的详细分类法重新通知数据集的一部分。然后，我们使用这些新注释微调了LLM，并将其性能与对原始四类分类法培训的模型进行比较。自动评估和定性评估都表明，细颗粒模型会产生更教学上的一致性和有效响应。我们的发现突出了在教育环境中对受控文本生成的意图特异性的价值，我们发布了带注释的数据和代码以促进进一步的研究。

Title: Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

Authors: Brian Gordon, Yonatan Bitton, Andreea Marzoca, Yasumasa Onoe, Xiao Wang, Daniel Cohen-Or, Idan Szpektor
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07631
Pdf URL: https://arxiv.org/pdf/2506.07631
Copy Paste: [[2506.07631]] Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline(https://arxiv.org/abs/2506.07631)
Keywords: language model, llm
Abstract: Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: this https URL
摘要：现在，大型视觉模型（VLM）生成了高度详细的段落图像标题，但是评估其事实准确性仍然具有挑战性。当前的方法通常会错过细粒度的错误，该错误是为较短的文本或缺少具有经过验证的不准确性的数据集而设计的。我们介绍了Docci Critique，这是一种基准，具有1,400 VLM生成的段落字幕（100张图像，14张VLM），其范围内的错误和解释性理由的10,216多个句子级人类注释，均在段落中。在此基础上，我们开发了Vnli-Critique，这是一种自动句子级别的事实分类和批评的模型。我们重点介绍了三个关键应用：（1）vnli-Critique证明了强大的概括，并通过最先进的M-Haldetect基准表现出了验证，并在巧克力要求验证中进行了强劲的结果。（2）Docci-Critique的VNLi-Critique驱动的Autorater提供了可靠的VLM排名，显示出与人类事实判断的极好的一致性（例如，Spearman 0.98）。（3）一种创新的批评家和革命管道，其中VNLI-Critique指南基于LLM的校正进行了批评，可以实现字幕事实的实质性改善（例如，详细信息Capaps-4870的46％增长）。我们的工作提供了至关重要的基准和实用工具，旨在显着提高标准，以进行细粒度评估并促进VLM图像理解的改善。项目页面：此HTTPS URL

Title: TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review

Authors: Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Zhijiang Guo, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07642
Pdf URL: https://arxiv.org/pdf/2506.07642
Copy Paste: [[2506.07642]] TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review(https://arxiv.org/abs/2506.07642)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at this https URL.
摘要：尽管大型语言模型（LLMS）在协助同行评审方面表现出了巨大的潜力，但当前的方法通常很难在保持效率的同时进行透彻和有见地的评论。在本文中，我们提出了TreeReview，这是一个新颖的框架，将纸质评论建模为层次结构和双向提问过程。 TreeReReview首先通过将高级问题递归分解为细粒度的子问题来构造一棵评论问题，然后通过迭代地汇总从LEAF到ROOT的答案以获取最终审查来解决问题树。至关重要的是，我们结合了动态问题的扩展机制，以在需要时通过产生后续问题进行更深入的探测。我们构建了一个源自ICLR和Neurips场地的基准测试，以评估我们的完整审查生成和可操作的反馈评论生成任务的方法。基于LLM的和人类评估的实验结果表明，TreeReview在提供全面，深入和专家一致的审查反馈方面的表现优于强大的基准，而与计算密集型方法相比，LLM令牌的使用率最高为80％。我们的代码和基准数据集可在此HTTPS URL上找到。

Title: Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Authors: Maciej Chrabąszcz, Katarzyna Lorenc, Karolina Seweryn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07645
Pdf URL: https://arxiv.org/pdf/2506.07645
Copy Paste: [[2506.07645]] Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models(https://arxiv.org/abs/2506.07645)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.
摘要：近年来，大型语言模型（LLMS）在各种自然语言处理（NLP）任务中表现出了令人印象深刻的功能。但是，它们对越狱和扰动的敏感性需要进行其他评估。许多LLM是多语言的，但与安全有关的培训数据主要包含高资源的语言，例如英语。这可能会使他们容易受到低资源语言（例如波兰语）的扰动。我们展示了通过仅更改几个字符并使用小型代理模型来计算单词重要性计算，可以廉价地产生强烈的攻击。我们发现这些角色和单词级攻击会大大改变不同LLM的预测，这表明可以使用该漏洞来规避其内部安全机制。我们验证了对波兰语（低资源语言）的攻击构建方法，并在此语言中找到LLM的潜在漏洞。此外，我们展示了如何将其扩展到其他语言。我们发布创建的数据集和代码以进行进一步研究。

Title: Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation

Authors: Rui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.07646
Pdf URL: https://arxiv.org/pdf/2506.07646
Copy Paste: [[2506.07646]] Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation(https://arxiv.org/abs/2506.07646)
Keywords: prompt
Abstract: In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. Our approach involves fine-tuning a large-scale pre-trained automatic speech recognition (ASR) model, conditioned on ground truth transcripts, to simultaneously output phrase-level graphemes and annotation labels. To further correct errors in phonemic labeling, we employ a decoding strategy that utilizes dictionary prior knowledge. The objective evaluation results demonstrate that our proposed method outperforms previous approaches relying solely on text or audio. The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.
摘要：在本文中，我们提出了一种在给定的音频转录对上注释音素和韵律标签的方法，该方法旨在构建日语文本到语音（TTS）数据集。我们的方法涉及对基于地面真相转录的条件的大规模预训练的自动语音识别（ASR）模型，以同时输出短语级别的素描和注释标签。为了进一步纠正语音标签中的错误，我们采用了使用词典先验知识的解码策略。客观评估结果表明，我们提出的方法优于以前的方法，仅依赖文本或音频。主观评估结果表明，使用我们的方法注释的标签训练的TTS模型合成的语音自然性与经过手动注释训练的模型相当。

Title: Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping

Authors: Nitin Sharma, Thomas Wolfers, Çağatay Yıldız
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07658
Pdf URL: https://arxiv.org/pdf/2506.07658
Copy Paste: [[2506.07658]] Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping(https://arxiv.org/abs/2506.07658)
Keywords: language model, gpt, llm, prompt
Abstract: The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.
摘要：本文解决了语言模型（LM）评估中的两个关键挑战：创建可靠的域特异性基准并了解域适应过程中的知识表示。我们介绍了一条确定的管道，将原始领域语料库转换为完成型基准，而无需依赖LMS或人类策划，从而消除了基准污染问题，同时可以对最新的域数据进行评估。我们的方法使用TF和术语TF-IDF方法和构建迅速目标对生成域特异性关键字和相关单词列表。我们通过测量其使用正确的域特异性目标完成这些提示的能力来评估模型，从而直接评估域知识的计算成本低。通过跨多个模型（GPT-2中/XL，Llama-2/3.1，Olmo-2，Qwen-2，Mismtral）和域的全面实验，我们证明我们的基准分析与专家生成的基准分析密切相关，同时提供了比传统的引发性验证度量更准确的域知识衡量标准。我们揭示了域的适应性在较小的模型（在500个步骤内）迅速发生，并说明了在培训期间基本模型中一种新的域知识评估方法。通过将机械分析扩展到域的适应性，我们发现初始到中的层主要负责属性提取，而后来的层则集中在近代的标记预测上。此外，我们表明，在适应过程中，忘记开始于中间层，属性提取发生并在以后的层中放大。我们的工作既可以为特定领域的LMS提供一种实用的评估方法，又提供了对适应过程中知识表示的新见解，对更有效的微调策略和有针对性的方法有影响，以减轻灾难性遗忘。

Title: Synthesis by Design: Controlled Data Generation via Structural Guidance

Authors: Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07664
Pdf URL: https://arxiv.org/pdf/2506.07664
Copy Paste: [[2506.07664]] Synthesis by Design: Controlled Data Generation via Structural Guidance(https://arxiv.org/abs/2506.07664)
Keywords: llm
Abstract: Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities.
摘要：由于复杂的逻辑和精确计算的需求，数学推理对于LLM的数学推理仍然具有挑战性。现有方法通过通过问题重新介绍综合数据集来增强LLM推理，但面临发电质量和问题复杂性的问题。为了解决这个问题，我们建议从数学推理中提取结构信息，并通过结构化解决方案指导数据生成。应用于数学和GSM8K，我们的方法通过标记的中间步骤和6.1k个问题的基准产生39K问题。基准上的结果表明，随着推理长度的增加，模型性能下降。此外，我们使用有关LLM范围的拟议培训数据进行了微调实验，结果验证了我们数据集的有效性。我们希望拟议的方法和数据集将有助于提高LLM推理能力的未来研究。

Title: Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch

Authors: Prarabdh Shukla, Wei Yin Chong, Yash Patel, Brennan Schaffner, Danish Pruthi, Arjun Bhagoji
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07667
Pdf URL: https://arxiv.org/pdf/2506.07667
Copy Paste: [[2506.07667]] Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch(https://arxiv.org/abs/2506.07667)
Keywords: chat
Abstract: To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement($\textit{e.g.}$, users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch's automated moderation tool ($\texttt{AutoMod}$) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch's APIs to send over $107,000$ comments collated from $4$ datasets. We measure $\texttt{AutoMod}$'s accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to $94\%$ on some datasets, $\textit{bypass moderation}$. Contextual addition of slurs to these messages results in $100\%$ removal, revealing $\texttt{AutoMod}$'s reliance on slurs as a moderation signal. We also find that contrary to Twitch's community guidelines, $\texttt{AutoMod}$ blocks up to $89.5\%$ of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in $\texttt{AutoMod}$'s capabilities and underscores the importance for such systems to understand context effectively.
摘要：为了满足内容审核的需求，在线平台已诉诸自动化系统。较新的实时参与形式（$ \ textit {e.g。} $，在Twitch等平台上发表评论的用户在Twitch等平台上发表了此类调节系统期望的潜伏期的其他压力。尽管率很高，但对这些系统的有效性知之甚少。在本文中，我们对Twitch的自动审核工具（$ \ texttt {Automod} $）进行了审核，以调查其在标记可恨内容的有效性。对于我们的审核，我们创建流媒体帐户以充当孤立的测试床，并使用Twitch的API与实时聊天接口，以发送超过$ 107,000的评论，从$ 4 $数据集中整理。我们测量$ \ texttt {automod} $在公然仇恨的内容上的准确性，其中包含厌女症，种族主义，能力主义和同性恋恐惧症。我们的实验表明，某些数据集中的大量可恨消息，最高$ 94 \％$，$ \ textit {bypass Mederation} $。在这些消息中上下文添加诽谤会导致$ 100 \％$删除，从而揭示了$ \ texttt {automod} $对slurs的依赖作为节制信号。我们还发现，与Twitch的社区准则相反，$ \ texttt {automod} $块最高$ 89.5 \％$ $ $ $ $ $ $ $ $ $ $ $ $的良性示例，这些示例在教学或授权的环境中使用敏感单词。总体而言，我们的审计指出了$ \ texttt {automod} $的功能的较大差距，并强调了此类系统有效理解上下文的重要性。

Title: GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

Authors: Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Rexhina Blloshmi, Christopher Davis, Adrià de Gispert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07671
Pdf URL: https://arxiv.org/pdf/2506.07671
Copy Paste: [[2506.07671]] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation(https://arxiv.org/abs/2506.07671)
Keywords: llm
Abstract: We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM's ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F1 in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.
摘要：我们展示了车库，这是一个大型的抹布基准，并具有每个接地通道的人体策划的长格式答案和注释，从而可以对LLM的精细评估在产生抹布答案时是否可以识别相关的接地。我们的基准包含2366个有关复杂性，活力和主题的问题，其中包括从私人文档集和Web中检索到的35k注释的段落，以反映现实世界中的抹布用例。这使其成为评估LLM仅识别构成响应所需信息的能力的理想测试床，或者在信息不足时提供偏转响应。对车库上多个最先进的LLM的评估表明，这些模型往往会过度夏季化，而不是（a）严格地将它们的答案基于带注释的相关段落（最多达到相关性的60％的分数），或（b）在没有相关基础的情况下（最多可获得31％真实的正速率）偏向。 F1归因于相关来源的最多为58.9％，我们表明，在回答时间敏感的问题以及必须从稀疏的私人接地来源中获取知识时，性能尤其降低。

Title: Training Superior Sparse Autoencoders for Instruct Models

Authors: Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07691
Pdf URL: https://arxiv.org/pdf/2506.07691
Copy Paste: [[2506.07691]] Training Superior Sparse Autoencoders for Instruct Models(https://arxiv.org/abs/2506.07691)
Keywords: language model, llm
Abstract: As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at this https URL.
摘要：随着大型语言模型（LLM）的规模和能力的增长，了解其内部机制变得越来越关键。稀疏的自动编码器（SAE）已成为机械解释性的关键工具，从而使人类可解开功能从LLM中提取。但是，现有的SAE培训方法主要是为基本模型设计的，当应用于指导模型时，重建质量和解释性降低。要弥合这一间隙，我们建议$ \下划线{\ textbf {f}} $ intuning- $ \ usewissline {\ textbf {a}} $ ligned $ \ usepline {\ textbf {s s}} $ equential $ equital $ quines $ \ extialline {专门针对指导模型量身定制的方法。 $ \ textIt {fast} $将培训过程与指导模型的数据分布和激活模式一致，从而使重建和特征可解释性的重大改进。在qwen2.5-7b-instruct上，$ \ textit {fast} $在令牌重建中达到了0.6468的平方误差，大大优于5.1985和1.5096的基线方法。在功能可解释性中，$ \ textit {fast} $可产生更高比例的高质量功能，对于llama3.2-3b-Instruct，$ 21.1 \％$ $在顶部得分，而$ 7.0 \％\％$ $ $ $ 10.2 \％$ $ \％$ for $ \ textit {bt（bt（p）} $ and $ \ textit and textit（bt textit and textit and textIt and textit and textit and textit and} $。令人惊讶的是，我们发现通过SAE介入特殊令牌的激活会导致产出质量的提高，这为对模型行为的细粒度控制提供了新的机会。该HTTPS URL可用代码，数据和240个训练有素的SAE。

Title: Through the Valley: Path to Effective Long CoT Training for Small Language Models

Authors: Renjie Luo, Jiaxi Li, Chen Huang, Wei Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07712
Pdf URL: https://arxiv.org/pdf/2506.07712
Copy Paste: [[2506.07712]] Through the Valley: Path to Effective Long CoT Training for Small Language Models(https://arxiv.org/abs/2506.07712)
Keywords: language model, chain-of-thought
Abstract: Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.
摘要：长期的经营链（COT）的监督已成为增强语言模型推理的共同策略。虽然对大型模型有效，但我们确定了一种现象，我们称之为长COT降解，其中在有限的长COT数据经历的小语言模型（SLMS; <= 3B参数）上经历了严重的性能恶化。通过对Qwen2.5，Llama3和Gemma3家族进行的广泛实验，我们证明了这种降解在SLM中广泛存在。在某些情况下，仅在8K长的婴儿床示例中接受培训的模型在微调之前损失了其原始性能的75％。令人惊讶的是，我们进一步观察到，对于某些特别小型的模型，即使在220k长的婴儿床示例上进行训练也无法在微调之前恢复或超过其原始性能。我们的分析将这种影响归因于错误积累：虽然较长的响应增加了多步推理的能力，但它们也扩大了复杂错误的风险。此外，我们发现长长的COT退化可能会对下游增强学习（RL）产生负面影响，尽管这可以通过足够缩放的监督微调（SFT）来缓解。我们的发现挑战了关于长床培训对SLM的好处的共同假设，并为建立更有效的小规模推理模型提供了实用的指导。

Title: Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU

Authors: Vincenzo Timmel, Manfred Vogel, Daniel Perruchoud, Reza Kakooee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07726
Pdf URL: https://arxiv.org/pdf/2506.07726
Copy Paste: [[2506.07726]] Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU(https://arxiv.org/abs/2506.07726)
Keywords: gpt, llm
Abstract: This paper presents a new long-form release of the Swiss Parliaments Corpus, converting entire multi-hour Swiss German debate sessions (each aligned with the official session protocols) into high-quality speech-text pairs. Our pipeline starts by transcribing all session audio into Standard German using Whisper Large-v3 under high-compute settings. We then apply a two-step GPT-4o correction process: first, GPT-4o ingests the raw Whisper output alongside the official protocols to refine misrecognitions, mainly named entities. Second, a separate GPT-4o pass evaluates each refined segment for semantic completeness. We filter out any segments whose Predicted BLEU score (derived from Whisper's average token log-probability) and GPT-4o evaluation score fall below a certain threshold. The final corpus contains 801 hours of audio, of which 751 hours pass our quality control. Compared to the original sentence-level SPC release, our long-form dataset achieves a 6-point BLEU improvement, demonstrating the power of combining robust ASR, LLM-based correction, and data-driven filtering for low-resource, domain-specific speech corpora.
摘要：本文介绍了瑞士议会语料库的新长期释放，将整个多小时的德国辩论会议（每次与正式会议协议保持一致）转换为高质量的语音文本对。我们的管道首先使用在高计算设置下使用Whisper大V3将所有会话音频转录为标准德语。然后，我们应用了两步的GPT-4O校正过程：首先，GPT-4O与官方协议一起摄入原始的耳语输出，以完善错误认识，主要命名为实体。其次，单独的GPT-4O通行证评估每个精制段的语义完整性。我们滤除了预测BLEU得分（源自Whisper的平均令牌对数概率）和GPT-4O评估得分的所有细分市场低于一定阈值。最终语料库包含801小时的音频，其中751小时通过了我们的质量控制。与原始的句子级SPC版本相比，我们的长格式数据集取得了6点BLEU的改进，证明了结合强大的ASR，基于LLM的校正和数据驱动的低资源，特定域语音Corpor的功能。

Title: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Authors: Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe
Subjects: cs.CL, cs.AI, cs.SC
Abstract URL: https://arxiv.org/abs/2506.07751
Pdf URL: https://arxiv.org/pdf/2506.07751
Copy Paste: [[2506.07751]] Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking(https://arxiv.org/abs/2506.07751)
Keywords: language model, llm
Abstract: Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in their reasoning. I.e., they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In contrast, our approach focuses on "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. We find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstraL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
摘要：最近的研究表明，大型语言模型（LLM），尤其是较小的语言模型，通常缺乏推理的坚固性。也就是说，当面对分配变化时，例如数值或名义变量的变化或分散注意力从句的插入时，它们倾向于体验性能下降。解决此问题的可能策略涉及生成合成数据，以进一步“实例化”潜在变化的推理问题。相反，我们的方法着重于“抽象”推理问题。这不仅有助于抵消分配变化，而且还有助于与符号工具的连接来推导解决方案。我们发现，通过强化学习（RL）更好地获取了这个抽象过程，而不是只有监督的微调，而细微调整通常无法产生忠实的抽象。我们的方法是拔出的方法 - 使用RL在粒度抽象数据上促进LLM中的抽象推理 - 可显着减轻最近GSM扰动基准的性能降解。

Title: LLM Unlearning Should Be Form-Independent

Authors: Xiaotian Ye, Mengqi Zhang, Shu Wu
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07795
Pdf URL: https://arxiv.org/pdf/2506.07795
Copy Paste: [[2506.07795]] LLM Unlearning Should Be Form-Independent(https://arxiv.org/abs/2506.07795)
Keywords: language model, llm
Abstract: Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.
摘要：大型语言模型（LLM）旨在消除或抑制模型中的不良知识，从而有希望控制有害或私人信息以防止滥用。但是，最近的研究强调了其在实际情况下的有限效力，从而阻碍了实际采用。在这项研究中，我们确定了许多下游失败的根源的普遍问题：现有的未学习方法的有效性在很大程度上取决于训练样本的形式，并且经常无法推广到相同知识的替代表达。我们正式将此问题表征为依赖形式的偏差，并系统地研究其在各种下游任务中的特定表现模式。为了量化其流行率和支持未来的研究，我们引入了ORT，这是一种新颖的基准测试，旨在评估未学习方法的鲁棒性，以防止知识表达的变化。结果表明，在当前技术中，依赖形式的偏见既广泛又严重。我们认为，LLM Uncorning应该与形式无关，以解决在现实世界中批判性安全方案中遇到的下游任务的无尽形式。为了实现这一目标，我们介绍了一种新颖的无培训方法，作为一种有前途的解决方案路径，我们引入了Rank-One Concept Redirection（ROCR）。 ROCR通过针对下游任务，特别是激活的危险概念来定位不变式来执行不学习。它能够在几秒钟内修改模型参数，以将模型对特定学习目标概念的感知重定向到另一个无害概念。广泛的实验表明，与传统方法相比，ROCR在产生高自然产出的同时显着提高了未学习效率。

Title: WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Authors: Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07818
Pdf URL: https://arxiv.org/pdf/2506.07818
Copy Paste: [[2506.07818]] WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code(https://arxiv.org/abs/2506.07818)
Keywords: language model, llm
Abstract: With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.
摘要：随着生成AI技术的快速发展，多模式大语言模型（MLLM）有可能充当能够执行复杂Web应用程序开发的AI软件工程师。考虑到该模型需要多维次级可容纳的融合以应对各种开发阶段的挑战，因此构建多视图评估框架对于准确指导提高发展效率至关重要。但是，现有的基准通常无法评估子可容量，而仅专注于网页生成结果。在这项工作中，我们从软件工程原理中汲取灵感，并进一步提出Webuibench，该基准是一种基准，该基准旨在评估四个关键领域的MLLM：WebUI感知，HTML编程，WebUI-HTML理解和WebUI-to-to-to-to-ode。 Webuibench包括21K高质量的提问对，源自超过0.7K现实世界网站。对29个主流MLLM的广泛评估发现了模型在开发过程中遇到的技能特征和各种弱点。

Title: Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning

Authors: Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07851
Pdf URL: https://arxiv.org/pdf/2506.07851
Copy Paste: [[2506.07851]] Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning(https://arxiv.org/abs/2506.07851)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model's attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model's capacity to infer authentic causal instruction-response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student's attention with the teacher's focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning and code generation benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.
摘要：大型语言模型（LLM）在上下文理解方面已显示出显着改善。但是，他们在长期文化推理和一代中参与真正关键信息的能力仍然落后于速度。具体而言，我们的初步实验表明，某些分心模式会在推断过程中误导该模型的注意力，并且消除这些模式可以大大提高推理的准确性和发电质量。我们将这种现象归因于训练数据中的虚假相关性，这阻碍了模型推断真实的因果指导 - 响应关系的能力。这种现象可能引起冗余的推理过程，可能导致明显的推理开销，更重要的是产生错误或次优响应。为了减轻这种情况，我们引入了一个两阶段的框架，称为“学习集中精力（LEAF）利用基于干预的推理，对脱离混杂因素。在第一阶段，LEAF与高级教师基于基于梯度的比较自动根据培训语料库中的因果关系来自动识别混杂令牌。然后，在第二阶段，它会在蒸馏过程中修剪这些令牌，以制定干预，使学生的注意力与教师的重点分布在真正的关键背景令牌上。实验结果表明，LEAF不仅在各种数学推理和代码生成基准中实现了绝对改善，而且还可以有效地抑制推理过程中对混淆令牌的关注，从而产生了更加可解释和可靠的推理模型。

Title: MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Authors: Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07899
Pdf URL: https://arxiv.org/pdf/2506.07899
Copy Paste: [[2506.07899]] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs(https://arxiv.org/abs/2506.07899)
Keywords: language model, llm, hallucination, prompt
Abstract: Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably - without retraining or forgetting previous information - remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks across LLaMA-3 and Mistral demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.
摘要：现实世界中部署的语言模型通常需要事后更新，以结合新知识或更正的知识。但是，有效，可靠地编辑此类模型 - 而不必重新培训或忘记以前的信息 - 仍然是一个重大挑战。现有的终身模型编辑方法折衷概括，干扰过去的编辑，或者无法扩展到长期编辑序列。我们提出了回忆录，这是一种新颖的可扩展框架，该框架通过残留记忆（即专用参数模块）注入知识，同时保留了预训练模型的核心功能。通过通过样本依赖性掩码来稀疏输入激活，回忆录将每个编辑局限于内存参数的不同子集，从而最大程度地减少编辑之间的干扰。在推断时，它通过将新查询的稀疏激活模式与编辑过程中存储的稀疏激活模式进行比较来确定相关的编辑。这使概括能够通过仅激活相关知识来改写查询，同时抑制不必要的记忆激活无关的提示。关于回答，幻觉校正和跨分布式概括的实验，跨骆驼-3和米斯特拉尔的实验表明，回忆录在可靠性，概括和局部指标上实现了最先进的性能，并以最小的遗忘范围扩展到数千个顺序编辑。

Title: MiniCPM4: Ultra-Efficient LLMs on End Devices

Authors: MiniCPM Team: Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07900
Pdf URL: https://arxiv.org/pdf/2506.07900
Copy Paste: [[2506.07900]] MiniCPM4: Ultra-Efficient LLMs on End Devices(https://arxiv.org/abs/2506.07900)
Keywords: language model, llm, chat
Abstract: This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose this http URL that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.
摘要：本文介绍了MinicPM4，这是一种高效的大型语言模型（LLM），该模型明确设计为端侧设备。我们通过在四个关键方面的系统创新来实现这一效率：模型架构，培训数据，培训算法和推理系统。具体而言，就模型架构而言，我们提出了INFLLM V2，这是一种可训练的稀疏注意机制，可加速预填充和解码阶段，以进行长篇文化处理。关于培训数据，我们提出了UltraClean，这是一种有效，准确的预训练数据过滤和生成策略，以及Ultrachat V2，这是一个全面的监督微调数据集。这些数据集使仅使用8万亿个培训令牌可以实现令人满意的模型性能。关于培训算法，我们提出了模型Tunnel V2，以进行有效的训练前策略搜索，并通过引入块构成块的推出，以提高质量平衡的加固学习和数据有效的终止LLM，BITCPM。关于推理系统，我们提出了该HTTP URL，该HTTP URL集成了稀疏的注意力，模型量化和投机抽样，以实现有效的预填充和解码。为了满足不同的设备要求，MinicPM4分别有两个版本，分别为0.5B和8B参数。足够的评估结果表明，MiniCPM4的表现优于多个基准的开源模型，其尺寸相似，强调了其效率和有效性。值得注意的是，在处理长序列时，MinicPM4-8B在QWEN3-8B上表现出显着的速度提高。通过进一步的适应，MiniCPM4成功地为各种应用程序提供了动力，包括值得信赖的调查生成和与模型上下文协议一起使用的工具使用，清楚地展示了其广泛的可用性。

Title: Quantum Graph Transformer for NLP Sentiment Classification

Authors: Shamminuj Aktar, Andreas Bärtschi, Abdel-Hameed A. Badawy, Stephan Eidenbenz
Subjects: cs.CL, quant-ph
Abstract URL: https://arxiv.org/abs/2506.07937
Pdf URL: https://arxiv.org/pdf/2506.07937
Copy Paste: [[2506.07937]] Quantum Graph Transformer for NLP Sentiment Classification(https://arxiv.org/abs/2506.07937)
Keywords: language model
Abstract: Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.
摘要：量子机器学习是建立更高效和表现力的模型的有前途的方向，尤其是在理解复杂，结构化数据至关重要的领域。我们提出了量子图变压器（QGT），这是一种基于混合图的体系结构，将量子自我注意的机制集成到结构化语言建模的消息通话框架中。使用参数化的量子电路（PQC）实现了注意机制，这使该模型能够捕获丰富的上下文关系，同时与经典的注意机制相比，可以显着减少可训练参数的数量。我们评估五个情感分类基准测试的QGT。实验结果表明，与现有的量子自然语言处理（QNLP）模型相比，QGT始终达到更高或可比的精度，包括基于注意力的方法和基于非注意的方法。与等效的经典图形变压器相比，QGT在现实世界数据集的平均准确度提高了5.42％，合成数据集的平均准确度提高了4.76％。此外，QGT证明了提高的样品效率，要求标记的样品少近50％才能达到Yelp数据集的可比性能。这些结果突出了基于图的QNLP技术的潜力，以提高高效且可扩展的语言理解。

Title: Statistical Hypothesis Testing for Auditing Robustness in Language Models

Authors: Paulius Rauba, Qiyao Wei, Mihaela van der Schaar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07947
Pdf URL: https://arxiv.org/pdf/2506.07947
Copy Paste: [[2506.07947]] Statistical Hypothesis Testing for Auditing Robustness in Language Models(https://arxiv.org/abs/2506.07947)
Keywords: language model, llm
Abstract: Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.
摘要：考虑测试大语言模型（LLM）系统的输出在任意干预下（例如输入扰动还是更改模型变体）的问题。我们不能简单地比较两个LLM输出，因为由于系统的随机性质，它们可能会有所不同，因此由于计算棘手的性能，我们也无法比较整个输出分布。尽管存在分析基于文本的产出的现有方法，但它们集中在根本上不同的问题上，例如测量偏见或公平性。为此，我们引入了基于分布的扰动分析，该框架将LLM扰动分析重新定义为常见的假设检验问题。我们通过蒙特卡洛采样在低维语义相似性空间内构建经验无效和替代输出分布，从而实现了可拖动的推断而无需限制性分布假设。该框架是（i）模型 - 静态的，（ii）支持对任何黑盒LLM上任意输入扰动的评估，（iii）产生可解释的p值；（iv）通过受控错误率支持多个扰动；（v）提供标量效应大小。我们在多个案例研究中证明了该框架的有用性，显示了我们如何量化响应变化，测量真/假阳性速率并评估参考模型的一致性。最重要的是，我们将其视为LLM审核的可靠的常见假设测试框架。

Title: Language Models over Canonical Byte-Pair Encodings

Authors: Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian DuSell, Benjamin LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O'Donnell, Ryan Cotterell
Subjects: cs.CL, cs.FL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07956
Pdf URL: https://arxiv.org/pdf/2506.07956
Copy Paste: [[2506.07956]] Language Models over Canonical Byte-Pair Encodings(https://arxiv.org/abs/2506.07956)
Keywords: language model
Abstract: Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
摘要：现代语言模型代表字符字符串上的概率分布，作为通过确定性代币器（例如字节对编码）得出的（较短）的令牌字符串。虽然这种方法在将语言模型扩展到大型语料库方面非常有效，但其当前的化身具有关注的属性：该模型分配了$ \ it {noncaronical} $ doken编码的每个字符串的$ \ it $ suppersibal质量质量，这些字符串的编码是不可能训练的。大的）。这种不当分配既是错误的，因为非规范的弦从未出现在训练数据中，并且浪费，将概率质量从合理的产出中转移出来。这些是可避免的错误！在这项工作中，我们提出了在令牌级的语言模型中执行规范性的方法，以确保只有规范的令牌字符串被分配了积极的概率。我们提出了两种方法：（1）规范性通过调节，利用未经其他培训的测试时间推理策略，以及（2）通过施工的规范性，该模型参数化保证了规范的输出，但需要培训。我们证明，修复规范性错误可以提高多种模型和语料库中持有数据的可能性。

Title: Correlated Errors in Large Language Models

Authors: Elliot Kim, Avi Garg, Kenny Peng, Nikhil Garg
Subjects: cs.CL, cs.AI, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2506.07962
Pdf URL: https://arxiv.org/pdf/2506.07962
Copy Paste: [[2506.07962]] Correlated Errors in Large Language Models(https://arxiv.org/abs/2506.07962)
Keywords: language model, llm
Abstract: Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.
摘要：假定培训数据，体系结构和提供商的多样性可以减轻LLM的同质性。但是，我们缺乏关于不同LLM是否有意义差异的经验证据。我们使用两个受欢迎的排行榜和一项简历筛选任务对总体350多个LLM进行了大规模的经验评估。我们发现模型错误中的实质性相关性 - 在一个排行榜数据集上，模型在两个模型错误时都同意60％的时间。我们确定驱动模型相关性的因素，包括共享的体系结构和提供商。然而，至关重要的是，即使与不同的架构和提供商，更大，更准确的模型也具有高度相关的错误。最后，我们在两个下游任务中显示了相关性的影响：LLM-As-As-Gudge评估和招聘 - 后者反映了有关算法单栽培的理论预测。

Title: Reinforcement Pre-Training

Authors: Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.08007
Pdf URL: https://arxiv.org/pdf/2506.08007
Copy Paste: [[2506.08007]] Reinforcement Pre-Training(https://arxiv.org/abs/2506.08007)
Keywords: language model
Abstract: In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
摘要：在这项工作中，我们将强化预训练（RPT）作为大型语言模型和强化学习（RL）的新缩放范式（RPT）。具体来说，我们将下一步预测重新构建为使用RL训练的推理任务，在该任务中，它可以在其中获得可验证的奖励，以正确预测给定上下文的下一代币。 RPT提供了一种可扩展的方法来利用大量文本数据作为通用RL，而不是依靠特定于域的注释答案。通过激励下一言推理的能力，RPT显着提高了预测下一代币的语言建模准确性。此外，RPT为进一步加强进行微调提供了强大的预训练基础。缩放曲线表明，增加的训练始终如一地提高了下一步的预测准确性。结果将RPT定位为有效且有希望的缩放范式，以提高语言模型预训练。