2025-06-12

Title: LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Authors: Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09147
Pdf URL: https://arxiv.org/pdf/2506.09147
Copy Paste: [[2506.09147]] LLM-as-a-qualitative-judge: automating error analysis in natural language generation(https://arxiv.org/abs/2506.09147)
Keywords: language model, llm, prompt
Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at this https URL.
摘要：促使大型语言模型（LLM）评估生成的文本（称为LLM-AS-A-Gudge）已成为自然语言生成（NLG）的标准评估方法，但主要用作定量工具，即具有数值得分作为主要输出。在这项工作中，我们提出了LLM-AS-A-A-质量法官，一种基于LLM的评估方法是NLG系统输出中常见问题类型的结构化报告。我们的方法旨在为开发人员提供有意义的见解，以了解可以对给定的NLG系统进行哪些改进，并包括两个主要步骤，即使用直觉累积算法对发现的问题进行分析和聚类。我们还介绍了一种评估所提出方法的策略，并在12个NLG数据集中的实例中进行了约300个问题。 Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators.我们的代码和数据在此HTTPS URL上公开可用。

Title: PHRASED: Phrase Dictionary Biasing for Speech Translation

Authors: Peidong Wang, Jian Xue, Rui Zhao, Junkun Chen, Aswin Shanmugam Subramanian, Jinyu Li
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.09175
Pdf URL: https://arxiv.org/pdf/2506.09175
Copy Paste: [[2506.09175]] PHRASED: Phrase Dictionary Biasing for Speech Translation(https://arxiv.org/abs/2506.09175)
Keywords: language model
Abstract: Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall.
摘要：短语对于了解对话中的核心概念至关重要。但是，由于它们在培训数据中很少出现，因此在语音翻译任务中，正确翻译的短语翻译是具有挑战性的。在本文中，我们提出了一种短语词典偏见方法，以利用从源语言到目标语言的映射的短语对。我们将短语词典偏置方法应用于两种广泛采用的模型，一个基于换能器的流语音翻译模型和多模式的大型语言模型。实验结果表明，对于流语音翻译模型，短语词典偏置方法比相对偏见的短语列表相对偏置21％。此外，短语词典偏见使多模式的大语言模型能够使用外部短语信息，从而实现了85％的短语回忆相对改善。

Title: Extrapolation by Association: Length Generalization Transfer in Transformers

Authors: Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09251
Pdf URL: https://arxiv.org/pdf/2506.09251
Copy Paste: [[2506.09251]] Extrapolation by Association: Length Generalization Transfer in Transformers(https://arxiv.org/abs/2506.09251)
Keywords: language model
Abstract: Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization--the ability to extrapolate from shorter to longer inputs--through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.
摘要：变压器语言模型在自然语言领域表现出了令人印象深刻的概括能力，但我们对这种概括的产生方式缺乏良好的了解。在本文中，我们研究了长度的概括 - 通过\ textIt {task Cosisction}的镜头从较短到更长的输入中推断出的能力。我们发现可以在相关任务上进行长度概括是\ textit {Transled}。也就是说，训练具有更长且相关的辅助任务的模型可以使其从其他目标任务中概括为看不见和更长的输入。我们证明了跨不同算法任务的长度泛化转移，包括算术操作，字符串转换和迷宫导航。我们的结果表明，在经过培训时，变压器模型可以从相似任务中继承概括能力。此外，我们观察到在经过验证的语言模型中类似的转移效应，这表明预处理将模型与可重复使用的计算脚手架配对，从而有助于下游环境中的外推。最后，我们提供了最初的机械证据，即长度泛化转移与任务之间相同注意力的重复使用相关。我们的发现共同加深了我们对变形金刚如何推广到分布输入的理解，并强调了跨任务的归纳结构的组成重用。

Title: Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat

Authors: Zhuofang Li, Rafal Kocielnik, Fereshteh Soltani, Penphob (Andrea)Boonyarungsrit, Animashree Anandkumar, R. Michael Alvarez
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.09259
Pdf URL: https://arxiv.org/pdf/2506.09259
Copy Paste: [[2506.09259]] Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat(https://arxiv.org/abs/2506.09259)
Keywords: chat
Abstract: Millions of players engage daily in competitive online games, communicating through in-game chat. Prior research has focused on detecting relatively small volumes of toxic content using various Natural Language Processing (NLP) techniques for the purpose of moderation. However, recent studies emphasize the importance of detecting prosocial communication, which can be as crucial as identifying toxic interactions. Recognizing prosocial behavior allows for its analysis, rewarding, and promotion. Unlike toxicity, there are limited datasets, models, and resources for identifying prosocial behaviors in game-chat text. In this work, we employed unsupervised discovery combined with game domain expert collaboration to identify and categorize prosocial player behaviors from game chat. We further propose a novel Self-Anchored Attention Model (SAAM) which gives 7.9% improvement compared to the best existing technique. The approach utilizes the entire training set as "anchors" to help improve model performance under the scarcity of training data. This approach led to the development of the first automated system for classifying prosocial behaviors in in-game chats, particularly given the low-resource settings where large-scale labeled data is not available. Our methodology was applied to one of the most popular online gaming titles - Call of Duty(R): Modern Warfare(R)II, showcasing its effectiveness. This research is novel in applying NLP techniques to discover and classify prosocial behaviors in player in-game chat communication. It can help shift the focus of moderation from solely penalizing toxicity to actively encouraging positive interactions on online platforms.
摘要：数以百万计的玩家每天从事竞争性在线游戏，并通过游戏中的聊天进行交流。先前的研究重点是使用各种自然语言处理（NLP）技术来检测相对较少的有毒内容，以进行节奏。但是，最近的研究强调了检测亲社会交流的重要性，这可能与鉴定有毒相互作用一样至关重要。认识到亲社会行为允许其分析，奖励和促进。与毒性不同，数据集，模型和资源有限，用于识别游戏聊天文本中的亲社会行为。在这项工作中，我们采用了无监督的发现与游戏域专家协作相结合，从游戏聊天中识别和分类亲社会玩家行为。我们进一步提出了一种新型的自锚注意模型（SAAM），与最佳现有技术相比，该模型可提高7.9％。该方法利用整个训练集作为“锚”，以帮助在培训数据稀缺下改善模型性能。这种方法导致开发了第一个自动化系统，用于在游戏内聊天中对亲社会行为进行分类，尤其是考虑到没有大规模标记数据的低资源设置。我们的方法应用于最受欢迎的在线游戏标题之一 - 《使命召唤》（R）：现代战争（R）II，展示其有效性。这项研究是在应用NLP技术来发现和分类游戏中聊天沟通中的亲社会行为的新颖中。它可以帮助将适度的重点从完全惩罚毒性转变为积极鼓励在线平台上的积极互动。

Title: Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Authors: Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09277
Pdf URL: https://arxiv.org/pdf/2506.09277
Copy Paste: [[2506.09277]] Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models(https://arxiv.org/abs/2506.09277)
Keywords: language model, llm
Abstract: Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.
摘要：大型语言模型（LLM）证明了产生自由文本自然语言解释（自NLE）以证明其答案合理的能力。尽管它们的外观合乎逻辑，但自我的表现并不一定反映出LLM实际决策过程，这使这种解释不忠。尽管现有的测量自我忠诚的方法主要依赖于行为测试或计算块识别，但它们都没有研究模型推理的基础神经活动。这项工作引入了一个新颖的灵活框架，用于定量测量LLM生成的自我忠诚度，直接将后者与模型内部隐藏状态的解释进行比较。所提出的框架是多才多艺的，并通过建立自我和模型推理之间的直接联系来深入了解自我忠诚。这种方法可以提高对自我忠诚的理解，并为产生更忠实的自我融合提供了基础。

Title: $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding

Authors: Cesare Spinoso-Di Piano, David Austin, Pablo Piantanida, Jackie Chi Kit Cheung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09301
Pdf URL: https://arxiv.org/pdf/2506.09301
Copy Paste: [[2506.09301]] $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding(https://arxiv.org/abs/2506.09301)
Keywords: llm
Abstract: Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in human communication, resulting in utterances where the literal and the intended meanings do not match. The Rational Speech Act (RSA) framework, which explicitly models speaker intentions, is the most widespread theory of probabilistic pragmatics, but existing implementations are either unable to account for figurative expressions or require modeling the implicit motivations for using figurative language (e.g., to express joy or annoyance) in a setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware RSA $(RSA)^2$ framework which models figurative language use by considering a speaker's employed rhetorical strategy. We show that $(RSA)^2$ enables human-compatible interpretations of non-literal utterances without modeling a speaker's motivations for being non-literal. Combined with LLMs, it achieves state-of-the-art performance on the ironic split of PragMega+, a new irony interpretation dataset introduced in this study.
摘要：象征性语言（例如，具有讽刺意味，夸张，轻描淡写）在人类交流中无处不在，导致了文字和预期含义不匹配的话语。理性语音法（RSA）框架是明确对说话者意图进行建模的框架，是最广泛的概率语用学理论，但是现有的实现无法在设置特定方式中使用成像语言（例如，表达喜悦或烦恼）来考虑具有比喻性表达式或需要建模的隐含动机。在本文中，我们介绍了言辞 - 稳固感知的RSA $（RSA）^2 $框架，该框架通过考虑说话者使用的修辞策略来建模象征性语言的使用。我们表明，$（RSA）^2 $可以对非文字话语的人类兼容解释，而不会对说话者的非文字动机进行建模。结合LLMS，它在讽刺的pragmega+的讽刺性上实现了最先进的表现，这是本研究中引入的一种新的讽刺解释数据集。

Title: Alzheimer's Dementia Detection Using Perplexity from Paired Large Language Models

Authors: Yao Xiao, Heidi Christensen, Stefan Goetze
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09315
Pdf URL: https://arxiv.org/pdf/2506.09315
Copy Paste: [[2506.09315]] Alzheimer's Dementia Detection Using Perplexity from Paired Large Language Models(https://arxiv.org/abs/2506.09315)
Keywords: language model, llm, prompt
Abstract: Alzheimer's dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts language ability. This work extends the paired perplexity approach to detecting AD by using a recent large language model (LLM), the instruction-following version of Mistral-7B. We improve accuracy by an average of 3.33% over the best current paired perplexity method and by 6.35% over the top-ranked method from the ADReSS 2020 challenge benchmark. Our further analysis demonstrates that the proposed approach can effectively detect AD with a clear and interpretable decision boundary in contrast to other methods that suffer from opaque decision-making processes. Finally, by prompting the fine-tuned LLMs and comparing the model-generated responses to human responses, we illustrate that the LLMs have learned the special language patterns of AD speakers, which opens up possibilities for novel methods of model interpretation and data augmentation.
摘要：阿尔茨海默氏症的痴呆症（AD）是一种神经退行性疾病，具有认知能力下降，通常会影响语言能力。这项工作扩展了配对的困惑方法，通过使用最近的大型语言模型（LLM），即Mistral-7B的指令跟随版本。我们将准确性平均比最佳当前配对的困惑方法提高了3.33％，比Adress 2020 Challenge Chender-Marks的最佳方法比最高的方法提高了6.35％。我们的进一步分析表明，与其他遭受不透明决策过程的方法相比，提出的方法可以有效地检测出具有明确和可解释的决策边界的AD。最后，通过提示微调的LLM并比较对人类反应的模型产生的反应，我们说明了LLMS已经学习了广告扬声器的特殊语言模式，这为新颖的模型解释和数据增强方法提供了可能性。

Title: Towards Efficient and Effective Alignment of Large Language Models

Authors: Yuxin Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09329
Pdf URL: https://arxiv.org/pdf/2506.09329
Copy Paste: [[2506.09329]] Towards Efficient and Effective Alignment of Large Language Models(https://arxiv.org/abs/2506.09329)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs' ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models' constraint adherence, offering insights for future improvements.
摘要：大型语言模型（LLMS）在各种任务中具有出色的能力，但是使它们与人类期望有效，有效地保持一致仍然是一个至关重要的挑战。该论文通过引入数据收集，培训和评估中的新方法来提高LLM对齐方式。我们首先解决对齐数据收集。现有方法在很大程度上依赖于手动策划的数据集或专有模型。为了克服这些局限性，我们提出了狮子，这是一个对抗性蒸馏框架，它通过识别和生成挑战性指示来迭代地完善培训数据，从而实现最先进的零摄影推理。此外，我们引入了Web重建（WEBR），这是一个完全自动化的框架，该框架直接从原始Web文档中综合了指令调查数据，从而显着改善了数据多样性和可扩展性，而不是现有的合成数据方法。接下来，我们通过新颖的优化技术增强对齐训练。我们开发学习编辑（LTE），该框架使LLMS能够在保留现有信息的同时有效地整合新知识。 LTE利用元学习来改善实时和批处理知识更新。此外，我们介绍了桥接和建模相关性（BMC），即直接偏好优化（DPO）的完善，该优化（DPO）明确捕获了偏好数据中的令牌级相关性，从而导致质量检查和数学推理任务的优势对齐。最后，我们应对评估一致性的挑战。现有的基准强调响应质量，但忽略了对特定约束的依从性。为了弥合这一差距，我们介绍了CollowBench，这是一种多级，细粒的基准测试，评估LLMS遵循各种教学类型的复杂约束的能力。我们的结果暴露了当前模型的约束依从性中的关键弱点，为未来改进提供了见解。

Title: Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Authors: Arjun Vaithilingam Sudhakar
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2506.09331
Pdf URL: https://arxiv.org/pdf/2506.09331
Copy Paste: [[2506.09331]] Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation(https://arxiv.org/abs/2506.09331)
Keywords: language model, llm, agent
Abstract: Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.
摘要：现代大型语言模型（LLMS）在复杂的自然语言任务中表现出令人印象深刻的零射击和少量的概括能力，从而使其可以广泛用作虚拟助手，以用于翻译和摘要等各种应用。尽管仅在没有作者意图的明确监督的情况下接受了大型文本培训，但LLM似乎推断出文本互动的基本含义。这就提出了一个基本问题：LLMS是否可以建模和理由有关他人的意图，即他们具有一种心理理论形式？理解他人的意图对于有效的合作至关重要，这是人类社会成功的基础，对于包括人类和自治系统在内的多个代理之间的合作互动至关重要。在这项工作中，我们通过合作多代理增强学习（MARL）的角度研究了LLM中的心理理论，在该镜头中，代理商通过反复的互动学会合作，反映了人类的社会推理。我们的方法旨在增强人造代理与人工和人类伴侣的适应和合作的能力。通过利用能够自然语言互动的基于LLM的代理商，我们朝着创建可以促进无缝协作的混合人类AI系统迈进，对人工人工互动的未来产生了广泛的影响。

Title: RePO: Replay-Enhanced Policy Optimization

Authors: Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09340
Pdf URL: https://arxiv.org/pdf/2506.09340
Copy Paste: [[2506.09340]] RePO: Replay-Enhanced Policy Optimization(https://arxiv.org/abs/2506.09340)
Keywords: language model, llm, prompt
Abstract: Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at this https URL.
摘要：强化学习（RL）对于优化大型语言模型（LLM）至关重要。最新的小组相对政策优化（GRPO）使用每个提示的多个上车产量估算优势，从而导致高计算成本和低数据效率。为了解决这个问题，我们介绍了重播增强的策略优化（Repo），该策略利用了不同的重播策略来从重播缓冲区中检索售出样本，从而允许基于每个提示的更广泛，更多样化的样本进行策略优化。与GRPO相比，QWEN2.5-MATH-1.5B和QWEN3-1.7B的五个数学推理基准的五个LLMS实验表明，回购的绝对平均绩效增长分别为$ 18.4 $和4.1美元。进一步的分析表明，回购将计算成本提高$ 15 \％$，同时将有效优化步骤的数量提高了$ 48 \％\％$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $。可以通过此HTTPS URL访问存储库。

Title: Latent Multi-Head Attention for Small Language Models

Authors: Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09342
Pdf URL: https://arxiv.org/pdf/2506.09342
Copy Paste: [[2506.09342]] Latent Multi-Head Attention for Small Language Models(https://arxiv.org/abs/2506.09342)
Keywords: language model, gpt
Abstract: We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.
摘要：我们介绍了针对小语言模型的潜在多头注意（MLA）的首次全面研究，揭示了有趣的效率质量折衷。在100,000个合成故事中训练30m参数GPT模型，我们基准了三个体系结构变体：标准的多头注意（MHA），MLA和MLA，具有旋转位置嵌入（MLA+绳索）。我们的主要发现是，具有半级潜在尺寸的MLA+绳索（r = D/2）可实现45％的KV-CACHE内存减少，而验证损失仅增加0.3％（本质上是与MHA质量相匹配） - 一种帕累托改善了内存约束部署。我们进一步表明，在小型模型中，绳索对于MLA至关重要：没有它，MLA的表现不佳将香草的注意力提高了3-5％，但由于绳索，它超过了香草2％。 NVIDIA A100 GPU上的推论基准表明，R = D/2的MLA在保持内存节省的同时，在全级MLA上实现了1.4倍的速度。 GPT-4评估证实了困惑结果，我们的语法，创造力和一致性指标达到了最高质量的分数（7.4/10）。接受代码和模型将在接受后发布。

Title: OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment

Authors: Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Jieping Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09349
Pdf URL: https://arxiv.org/pdf/2506.09349
Copy Paste: [[2506.09349]] OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment(https://arxiv.org/abs/2506.09349)
Keywords: language model, llm
Abstract: Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling, featuring dual-resolution speech representations and contrastive cross-modal alignment. Our approach processes speech and text representations in parallel while enhancing audio comprehension through contrastive alignment. Experimental results on Spoken Question Answering benchmarks demonstrate that OmniDRCA establishes new state-of-the-art (SOTA) performance among parallel joint speech-text modeling based foundation models, and achieves competitive performance compared to interleaved models. Additionally, we explore the potential of extending the framework to full-duplex conversational scenarios.
摘要：有关大语言模型（LLM）端到端语音产生的最新研究吸引了社区的大大关注，多项工作扩展了基于文本的LLM，以生成离散的语音令牌。现有方法主要分为两类：（1）在不将其纳入LLM的自动回归过程的情况下独立生成离散语音令牌的方法，从而导致文本生成并不意识到并发的语音综合。（2）通过关节自回归建模产生交错或平行的语音文本代币的模型，从而在发电过程中实现了共同的情态意识。本文介绍了Omnidrca，这是一种基于联合自回归建模的平行语音文本基础模型，具有双分辨率语音表示和对比度跨模式对准。我们的方法可以并行处理语音和文本表示，同时通过对比度对齐增强音频理解。关于口语问题回答基准的实验结果表明，Omnidrca在基于平行的联合语音文本建模基础模型中建立了新的最先进（SOTA）性能，并且与交织模型相比，实现了竞争性能。此外，我们探索将框架扩展到全双工对话方案的潜力。

Title: DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

Authors: Yuchen Feng, Bowen Shen, Naibin Gu, Jiaxuan Zhao, Peng Fu, Zheng Lin, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09351
Pdf URL: https://arxiv.org/pdf/2506.09351
Copy Paste: [[2506.09351]] DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts(https://arxiv.org/abs/2506.09351)
Keywords: language model, llm
Abstract: Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.
摘要：大型语言模型（LLMS）与专家混合物（MOE）结构的混合物通过选择性激活参数的子集来实现高成本效益。尽管MOE LLMS的推断效率，但从头开始的广泛专家培训仍会大大开销，而将密集的LLM重建为Moe LLM会大大降低培训预算。但是，现有的重建方法通常会忽略专家之间的多样性，从而导致潜在的冗余。在本文中，我们提出了这样的观察结果，即特定的LLM在不同校准数据集上修剪后表现出显着的多样性，基于我们提出了一种名为Dive的多样性增强的重建方法。潜水的食谱包括域亲和力挖掘，基于修剪的专家重建和有效的再培训。具体而言，重建包括进料前网络（FFN）模块的修剪和重新组装。重建后，我们有效地对路由器，专家和归一化模块进行了重新训练。我们通过开源培训语料库对骆驼风格的LLM进行潜水。实验表明，潜水以最小的精度取舍实现训练效率，超过现有的修剪和MOE重建方法，具有相同数量的激活参数。

Title: Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL

Authors: Qingyun Zeng, Simin Ma, Arash Niknafs, Ashish Basran, Carol Szabo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09359
Pdf URL: https://arxiv.org/pdf/2506.09359
Copy Paste: [[2506.09359]] Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL(https://arxiv.org/abs/2506.09359)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) has significantly advanced Text-to-SQL (NL2SQL) systems, yet evaluating the semantic equivalence of generated SQL remains a challenge, especially given ambiguous user queries and multiple valid SQL interpretations. This paper explores using LLMs to assess both semantic and a more practical "weak" semantic equivalence. We analyze common patterns of SQL equivalence and inequivalence, discuss challenges in LLM-based evaluation.
摘要：大语言模型（LLMS）的兴起具有明显高级的文本到SQL（NL2SQL）系统，但是评估生成的SQL的语义等效性仍然是一个挑战，尤其是考虑到模棱两可的用户查询和多个有效的SQL解释。本文使用LLMS探讨了语义和更实用的“弱”语义对等效性。我们分析了SQL等效性和不等性的共同模式，讨论了基于LLM的评估中的挑战。

Title: COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content

Authors: Zhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09367
Pdf URL: https://arxiv.org/pdf/2506.09367
Copy Paste: [[2506.09367]] COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content(https://arxiv.org/abs/2506.09367)
Keywords: llm
Abstract: While Generative AI has demonstrated strong potential and versatility in content generation, its application to educational contexts presents several challenges. Models often fail to align with curriculum standards and maintain grade-appropriate reading levels consistently. Furthermore, STEM education poses additional challenges in balancing scientific explanations with everyday language when introducing complex and abstract ideas and phenomena to younger students. In this work, we propose COGENT, a curriculum-oriented framework for generating grade-appropriate educational content. We incorporate three curriculum components (science concepts, core ideas, and learning objectives), control readability through length, vocabulary, and sentence complexity, and adopt a ``wonder-based'' approach to increase student engagement and interest. We conduct a multi-dimensional evaluation via both LLM-as-a-judge and human expert analysis. Experimental results show that COGENT consistently produces grade-appropriate passages that are comparable or superior to human references. Our work establishes a viable approach for scaling adaptive and high-quality learning resources.
摘要：尽管生成的AI在内容生成中表现出强大的潜力和多功能性，但其在教育环境中的应用带来了一些挑战。模型通常无法与课程标准保持一致，并始终保持适合等级的阅读水平。此外，STEM教育在向年轻学生介绍复杂和抽象的思想和现象时，在平衡科学解释与日常语言方面提出了其他挑战。在这项工作中，我们提出了Cogent，这是一个面向课程的框架，用于生成适合成绩的教育内容。我们结合了三个课程组件（科学概念，核心思想和学习目标），通过长度，词汇和句子复杂性来控制可读性，并采用``基于奇迹''的方法来增加学生的参与和兴趣。我们通过LLM-AS-A-法官和人类专家分析进行了多维评估。实验结果表明，有力始终产生适合级别的段落，这些段落可与人类参考相当或优越。我们的工作建立了一种可行的方法来扩展适应性和高质量的学习资源。

Title: CoLMbo: Speaker Language Model for Descriptive Profiling

Authors: Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, Bhiksha Raj
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.09375
Pdf URL: https://arxiv.org/pdf/2506.09375
Copy Paste: [[2506.09375]] CoLMbo: Speaker Language Model for Descriptive Profiling(https://arxiv.org/abs/2506.09375)
Keywords: language model, prompt
Abstract: Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.
摘要：说话者识别系统通常仅限于分类任务，并难以产生详细的说话者特征或提供上下文富裕的描述。这些模型主要是为说话者识别提取嵌入，但无法以结构化的方式捕获语言，性别和年龄等人口统计学属性。本文介绍了Colmbo，Colmbo是一种扬声器语言模型（SLM），该模型通过将扬声器编码器与及时的条件集成在一起来解决这些限制。这允许创建基于说话者嵌入的详细标题。 Colmbo利用用户定义的提示动态适应新的扬声器特征，并提供自定义的描述，包括区域方言变化和与年龄相关的特征。这种创新的方法不仅可以增强传统的演讲者分析，而且可以在各种数据集的零拍摄方案中出色，这标志着说话者识别领域的显着进步。

Title: Comparing human and LLM politeness strategies in free production

Authors: Haoran Zhao, Robert D.Hawkins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09391
Pdf URL: https://arxiv.org/pdf/2506.09391
Copy Paste: [[2506.09391]] Comparing human and LLM politeness strategies in free production(https://arxiv.org/abs/2506.09391)
Keywords: language model, llm
Abstract: Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals -- from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.
摘要：礼貌的演讲对大语模型（LLM）提出了基本的一致性挑战。人类部署了丰富的语言策略曲目，以平衡信息和社会目标 - 从建立融洽关系的积极方法（称赞，感兴趣的表达）到负面策略的负面策略，这些策略最大程度地减少了征收（对冲，间接性）。我们研究LLM是否通过比较受约束和开放式生产任务中的人类和LLM响应来采用类似的上下文敏感曲目。我们发现，较大的模型（$ \ ge $ 70B参数）成功地复制了计算实用主义文献中的关键偏好，并且在开放式上下文中，人类评估者令人惊讶地更喜欢LLM生成的响应。但是，进一步的语言分析表明，即使在积极的环境中，模型也不成比例地依赖负面的礼貌策略，这可能导致误解。尽管现代LLM对礼貌策略有了令人印象深刻的处理，但这些微妙的差异引发了有关AI系统中务实一致性的重要问题。

Title: Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

Authors: Jui-Ming Yao, Hao-Yuan Chen, Zi-Xian Tang, Bing-Jia Tan, Sheng-Wei Peng, Bing-Cheng Xie, Shun-Feng Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09408
Pdf URL: https://arxiv.org/pdf/2506.09408
Copy Paste: [[2506.09408]] Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models(https://arxiv.org/abs/2506.09408)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39\% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.
摘要：大型语言模型（LLMS）在多项选择问题答案（MCQA）基准上表现出了令人印象深刻的表现，但它们仍然很容易受到较小的输入扰动的影响。在本文中，我们介绍和评估令牌约束解码（TCD）。这种简单而有效的推理时间算法强制执行令牌级别预测之间的一致性，以增强嘈杂设置的鲁棒性。通过对CONSENSENSENSENSENSENESQA，MMLU和MMLU-PRO进行的广泛实验，我们表明TCD，尤其是与及时工程（PE）固定配对时，大大恢复了因输入噪声而降低的性能，从而使高达+39 \％的绝对获得 +％的绝对获得，例如gemma3 1b（例如gemma3 1b）。罚球分析进一步表明，TCD隐含地正规化过度自信的产出，不同的模型需要明显的惩罚时间表以最大化弹性。我们的发现将TCD建立为一种实用的模型不足的方法，用于在现实世界中的缺陷下改善推理稳定性，并为在安全至关重要或面向用户的应用程序中更可靠地部署LLMS铺平道路。

Title: PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering

Authors: Xiujun Zhou, Pingjian Zhang, Deyou Tang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.09414
Pdf URL: https://arxiv.org/pdf/2506.09414
Copy Paste: [[2506.09414]] PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering(https://arxiv.org/abs/2506.09414)
Keywords: language model, llm, prompt
Abstract: Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models' generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.
摘要：知识图应答（KGQA）是自然语言处理中的一项关键任务，需要对知识图（kgs）进行推理才能回答自然语言问题。使用大型语言模型（LLM）的最新方法显示出了出色的语义解析功能，但受到不同注释数据和多跳的推理样本的稀缺的限制。传统的数据增强方法主要集中于单跳问题和容易出现语义失真的问题，而基于LLM的方法主要解决语义失真，但通常忽略了多跳的推理，从而限制了数据多样性。多跳样本的稀缺性进一步削弱了模型的概括。为了解决这些问题，我们提出了PGDA-KGQA，这是一个迅速引入的生成框架，具有多种数据增强策略的KGQA。 PGDA-KGQA的核心采用了统一的及时设计范式：通过制定精心设计的提示来整合提供的文本内容，它利用LLMS来生成大规模（问题，逻辑形式）对模型培训。具体而言，PGDA-KGQA丰富了其培训设置，作者：（1）产生单跳伪问题，以改善问题语义与KG关系的一致性；（2）应用更新的语义问题重写以提高针对语言变化的鲁棒性；（3）采用答案引导的反向路径探索来创建现实的多跳问题。 PGDA-KGQA采用增强生成的语义解析管道，利用增强数据来提高逻辑形式生成的准确性，从而提高答案检索性能。实验表明，在标准KGQA数据集上胜过最先进的方法，将WebQSP上的改进提高了2.8％，1.2％和3.1％，并且在F1，lits@1和准确性中分别在F1，lits@1和Comperial中提高了1.8％，1.1％，1.1％和2.4％。

Title: Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings

Authors: Md Messal Monem Miah, Adrita Anika, Xi Shi, Ruihong Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09424
Pdf URL: https://arxiv.org/pdf/2506.09424
Copy Paste: [[2506.09424]] Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings(https://arxiv.org/abs/2506.09424)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures and video summaries, and examine the effectiveness of different prompting strategies, including direct label generation and chain-of-thought reasoning. Our findings provide key insights into how LLMs process and interpret deceptive cues across modalities, highlighting their potential and limitations in real-world deception detection applications.
摘要：在越来越多的数字世界中检测欺骗既是一项艰巨且具有挑战性的任务。在这项研究中，我们对大型语言模型（LLM）和大型多模型（LMM）的自动欺骗检测功能进行了全面评估。我们在三个不同的数据集上评估了开源和商业LLM的性能：现实生活试验访谈（RLTD），人际场景中的欺骗（MU3D）和欺骗性评论（OPSPAM）。我们系统地分析了不同实验设置对欺骗检测的有效性，包括零射击和几种基于随机或基于相似性的内在示例选择的方法。我们的结果表明，微调的LLMS在文本欺骗检测任务上实现了最先进的表现，而LMM则难以充分利用跨模式提示。此外，我们分析了辅助特征（例如非语言手势和视频摘要）的影响，并研究不同提示策略的有效性，包括直接标签生成和经过想象的推理。我们的发现提供了有关LLMS如何处理和解释跨模式的欺骗性提示的关键见解，从而突出了它们在现实世界欺骗检测应用中的潜力和局限性。

Title: Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting

Authors: Fei Ding, Baiqiao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09428
Pdf URL: https://arxiv.org/pdf/2506.09428
Copy Paste: [[2506.09428]] Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting(https://arxiv.org/abs/2506.09428)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT), while enhancing large language models(LLMs)' instruction-following capabilities and domain-specific task adaptability, often diminishes their general capabilities. Moreover, due to the inaccessibility of original pre-training data, catastrophic forgetting tends to be exacerbated when third-party practitioners implement SFT on open-sourced models. To address this challenge, we propose a novel, more cost-effective SFT method which could effectively reduce the risk of catastrophic forgetting without access to original SFT data. Our approach begins by reconstructing the likely SFT instruction distribution of the base model, followed by a multi-model screening process to select optimal data, which is then mixed with new data for SFT. Experimental results demonstrate that our method preserves generalization capabilities in general domains while improving task-specific performance.
摘要：监督微调（SFT），同时增强了大型语言模型（LLMS）的“指导性功能和特定领域的任务适应性”，通常会降低其一般能力。此外，由于原始预培训数据的无法访问，当第三方从业人员对开源模型实施SFT时，灾难性遗忘往往会加剧。为了应对这一挑战，我们提出了一种新颖，更具成本效益的SFT方法，该方法可以有效地降低灾难性遗忘的风险，而无需访问原始的SFT数据。我们的方法首先重建基本模型的SFT指令分布，然后是多模型筛选过程以选择最佳数据，然后将其与SFT的新数据混合。实验结果表明，我们的方法可以保留一般领域中的概括能力，同时提高特定于任务的性能。

Title: GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

Authors: GigaChat team: Mamedov Valentin, Evgenii Kosarev, Gregory Leleytner, Ilya Shchuckin, Valeriy Berezovskiy, Daniil Smirnov, Dmitry Kozlov, Sergei Averkiev, Lukyanenko Ivan, Aleksandr Proshunin, Ainur Israfilova, Ivan Baskov, Artem Chervyakov, Emil Shakirov, Mikhail Kolesov, Daria Khomich, Darya Latortseva, Sergei Porkhun, Yury Fedorov, Oleg Kutuzov, Polina Kudriavtseva, Sofiia Soldatova, Kolodin Egor, Stanislav Pyatkin, Dzmitry Menshykh, Grafov Sergei, Eldar Damirov, Karlov Vladimir, Ruslan Gaitukiev, Arkadiy Shatenov, Alena Fenogenova, Nikita Savushkin, Fedor Minkin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09440
Pdf URL: https://arxiv.org/pdf/2506.09440
Copy Paste: [[2506.09440]] GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture(https://arxiv.org/abs/2506.09440)
Keywords: language model, llm, chat
Abstract: Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (this https URL), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language.
摘要：生成的大语言模型（LLM）对于现代NLP研究和各种语言的应用至关重要。但是，专门针对俄罗斯语言量身定制的基础模型的发展受到限制，这主要是由于所需的大量计算资源。本文介绍了俄罗斯LLM的Gigachat家族，包括各种尺寸，包括基本型号和教学调整版本。我们提供有关模型体系结构，预训练过程和实验的详细报告，以指导设计选择。此外，我们评估了它们在俄罗斯和英语基准测试中的表现，并将Gigachat与多语言类似物进行了比较。本文提供了可以通过API，电报机器人和Web界面访问的最佳模型的系统演示。此外，我们在开源（此HTTPS URL）中发布了三种开放的Gigachat模型，旨在扩大NLP研究机会并支持俄罗斯语言工业解决方案的开发。

Title: UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs

Authors: Prameshwar Thiyagarajan, Vaishnavi Parimi, Shamant Sai, Soumil Garg, Zhangir Meirbek, Nitin Yarlagadda, Kevin Zhu, Chris Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09450
Pdf URL: https://arxiv.org/pdf/2506.09450
Copy Paste: [[2506.09450]] UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs(https://arxiv.org/abs/2506.09450)
Keywords: language model, gpt, llm
Abstract: Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: this https URL.
摘要：心理理论（汤姆）是理解自己和他人心理状态的能力，仍然是大型语言模型（LLMS）的具有挑战性的领域，通常无法准确预测人类的精神状态。在本文中，我们介绍了Unitombench，这是一个统一的基准测试，该基准通过整合多相互作用的任务设计和不断发展的故事情景，整合了Simtom和Tombench的优势，以系统地改善和评估LLMS中的TOM功能。在一个自定义的数据集的支持下，Unitombench将观点的技术与多样化的评估指标结合在一起，以更好地刺激LLMS的社交认知。通过评估，我们观察到，尽管诸如GPT-4O和GPT-4O MINI之类的模型在涉及情绪和与信念相关的方案的任务中表现出一致的准确性，结果通常超过80％，但其跨知识任务的绩效却有很大的差异。这些结果突出了与TOM相关的任务中当前LLM的优势和局限性，强调了Unitombench作为未来开发的全面工具的价值。我们的代码在这里公开可用：此HTTPS URL。

Title: Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Authors: Zeguan Xiao, Yun Chen, Guanhua Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09457
Pdf URL: https://arxiv.org/pdf/2506.09457
Copy Paste: [[2506.09457]] Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms(https://arxiv.org/abs/2506.09457)
Keywords: language model, llm
Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.
摘要：直接比对算法（DAAS），例如直接偏好优化（DPO）和简单的偏好优化（SIMPO），已成为从人类反馈（RLHF）算法增强算法的有效替代方法，以使大语言模型（LLMS）与人类偏好相结合。但是，DAA受到了我们确定为“奖励产生差距”的基本限制，这是训练期间优化目标与推理过程中实际发电绩效之间的错位。在本文中，我们发现奖励产生差距的贡献是前缀代币在LLM生成过程中的固有重要性之间的不匹配，以及如何在DAA的隐式奖励函数中反映出这种重要性。为了弥合差距，我们引入了一种简单而有效的方法，称为前缀面向前缀的相等长度训练（诗人），该方法截断了首选和分配的响应，以匹配较短的长度。诗人的训练，每个样本中的两个响应都被截断至相等的长度，导致样本之间的截短长度不同，DAAS物镜的优化被隐式约束以在所有位置上收敛，从而比标准DAA更加关注前缀。我们使用两位代表性DAA的DPO和Simpo进行实验，表明诗人对其标准实现有所改善，在Alpacaeval 2中达到了15.6分，并且在下游任务中进行了整体改进。我们的结果强调了解决DAA中奖励优化和发电性能之间未对准的重要性。

Title: Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers

Authors: Ilanit Sobol, Shir Lissak, Refael Tikochinski, Tal Nakash, Anat Brunstein Klomek, Eyal Fruchter, Roi Reichart
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09495
Pdf URL: https://arxiv.org/pdf/2506.09495
Copy Paste: [[2506.09495]] Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers(https://arxiv.org/abs/2506.09495)
Keywords: llm
Abstract: Suicide remains a leading cause of death in Western countries, underscoring the need for new research approaches. As social media becomes central to daily life, digital footprints offer valuable insight into suicidal behavior. Focusing on individuals who attempted suicide while uploading videos to their channels, we investigate: How do suicidal behaviors manifest on YouTube, and how do they differ from expert knowledge? We applied complementary approaches: computational bottom-up, hybrid, and expert-driven top-down, on a novel longitudinal dataset of 181 YouTube channels from individuals with life-threatening attempts, alongside 134 control channels. In the bottom-up approach, we applied LLM-based topic modeling to identify behavioral indicators. Of 166 topics, five were associated with suicide-attempt, with two also showing temporal attempt-related changes ($p<.01$) - Mental Health Struggles ($+0.08$)* and YouTube Engagement ($+0.1$)*. In the hybrid approach, a clinical expert reviewed LLM-derived topics and flagged 19 as suicide-related. However, none showed significant attempt-related temporal effects beyond those identified bottom-up. Notably, YouTube Engagement, a platform-specific indicator, was not flagged by the expert, underscoring the value of bottom-up discovery. In the top-down approach, psychological assessment of suicide attempt narratives revealed that the only significant difference between individuals who attempted before and those attempted during their upload period was the motivation to share this experience: the former aimed to Help Others ($\beta=-1.69$, $p<.01$), while the latter framed it as part of their Personal Recovery ($\beta=1.08$, $p<.01$). By integrating these approaches, we offer a nuanced understanding of suicidality, bridging digital behavior and clinical insights. * Within-group changes in relation to the suicide attempt.
摘要：自杀仍然是西方国家死亡的主要原因，强调了对新研究方法的需求。随着社交媒体成为日常生活的核心，数字足迹为自杀行为提供了宝贵的见解。我们专注于在将视频上传到渠道时自杀的个人，我们调查：自杀行为如何在YouTube上表现出来，以及他们与专家知识有何不同？我们采用互补方法：在181个YouTube频道的新型纵向数据集中，来自具有威胁生命的尝试的人以及134个控制渠道的个人的181个YouTube渠道的新型纵向数据集中。在自下而上的方法中，我们应用了基于LLM的主题建模来识别行为指标。在166个主题中，有5个与自杀相关，其中两个也显示出与时间相关的变化（$ p <.01 $） - 心理健康斗争（$+0.08 $）*和YouTube参与度（$+0.1 $）*。在混合方法中，一名临床专家回顾了LLM衍生的主题，并将19个标记为自杀相关。但是，除了确定的自下而上，没有一个显示出显着的与尝试相关的时间影响。值得注意的是，专家没有标记YouTube参与度，这是一个特定于平台的指标，强调了自下而上发现的价值。 In the top-down approach, psychological assessment of suicide attempt narratives revealed that the only significant difference between individuals who attempted before and those attempted during their upload period was the motivation to share this experience: the former aimed to Help Others ($\beta=-1.69$, $p<.01$), while the latter framed it as part of their Personal Recovery ($\beta=1.08$, $p<.01$).通过整合这些方法，我们对自杀性，弥合数字行为和临床见解有细微的理解。 *组内改变与自杀企图有关。

Title: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

Authors: Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09501
Pdf URL: https://arxiv.org/pdf/2506.09501
Copy Paste: [[2506.09501]] Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning(https://arxiv.org/abs/2506.09501)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at this https URL.
摘要：大型语言模型（LLM）现在是各个领域不可或缺的，并且表现出令人印象深刻的性能。然而，进步基于这样的前提：基准分数既准确又可重复。我们证明了LLM性能的可重复性是脆弱的：更改系统配置，例如评估批次大小，GPU计数和GPU版本可能会在生成的响应中引入显着差异。在推理模型中，这个问题尤其明显，在推理模型中，早期令牌的小圆形差异可以层叠成不同思想链，最终影响准确性。例如，在Bfloat16的精度下，贪婪解码，诸如DeepSeek-R1-Distill-Qwen-7b之类的推理模型可以表现出高达9％的准确性变化，并且由于GPU计数，类型和评估批次尺寸的差异，响应长度的响应长度差异为9,000个令牌差异。我们将这种可变性的根本原因追溯到有限的数值精度下的浮点算术的非缔合性质。这项工作提出了第一个系统的研究，以了解数值精度如何影响LLM推断的可重复性。通过仔细控制的各种硬件，软件和精度设置的实验，我们量化了模型输出何时以及如何差异。我们的分析表明，在评估实践中通常会忽略浮点精度 - 虽然对可重复性至关重要。受此启发的启发，我们开发了一种称为LayerCast的轻量级推理管道，该管道将权重存储在16位精度中，但在FP32中执行所有计算，平衡记忆效率与数值稳定性。代码可在此HTTPS URL上找到。

Title: TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Authors: Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09507
Pdf URL: https://arxiv.org/pdf/2506.09507
Copy Paste: [[2506.09507]] TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding(https://arxiv.org/abs/2506.09507)
Keywords: language model
Abstract: Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongruity in their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance. To address this impediment, we propose a unified rotary position embedding (\textbf{\ourRoPE}) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this \ourRoPE, we introduce \textbf{\model}, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4K sequence length, \model exhibits training and inference speeds that are \textbf{42.3\% and 29.5\% faster}, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4\% on language modeling benchmarks. \model furthermore scales more effectively: \model-1.3B gains \textbf{7.22\%} in average accuracy over its 320M version (versus about 6\% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.
摘要：变形金刚在捕获长期依赖性方面表现出熟练程度，而状态空间模型（SSM）有助于线性时间序列建模。尽管它们具有协同的潜力，但这些架构的整合带来了重大挑战，主要归因于各自位置编码机制的根本不一致：变形金刚依赖于显式旋转位置嵌入（ROPE），而SSMS则利用隐含的位置表示通过卷积。这种差异通常会导致不连续性和次优性能。为了解决这一障碍，我们提出了一个统一的旋转位置嵌入（\ textbf {\ urrope}）方法，从而为自我注意力和状态空间组件建立了一个一致的位置编码框架。使用此\ oureRope，我们介绍了\ textbf {\ model}，这是一种混合体系结构，在此统一的位置编码方案下相干地集成了变压器和SSM层。在4K序列长度上，\模型分别表现为\ textBf {42.3 \％和29.5 \％faster}的训练速度和推理速度，相对于标准变压器模型。它还提供了更高的精度：在可比的设置下，它在语言建模基准测试基准上超过4 \％。 \ Model此外，更有效地缩放：\ Model-1.3b获得\ TextBf {7.22 \％}的平均精度比其320m版本（与等效变压器或SSMS的大约6 \％增益相比）。我们的结果表明，统一的位置编码可以解决混合模型中的位置不相容性，从而实现了有效的，高性能的长篇下说建模。

Title: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2506.09513
Pdf URL: https://arxiv.org/pdf/2506.09513
Copy Paste: [[2506.09513]] ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning(https://arxiv.org/abs/2506.09513)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
摘要：尽管基于推理的大语言模型（LLM）在数学和编程方面表现出色，但它们在知识密集型医学问答中的功能仍未得到充实。为了解决这个问题，我们介绍了最大的医学推理数据集原因，其中包括370k高质量的示例，这些示例是从各种LLMS产生的170万个初始推理路径中提取的。理性是通过\ textIt {多代理验证和完善过程}构建的，在该过程中，我们设计了\ textit {error Refiner}，以通过识别和纠正验证程序标记的易于错误的步骤来增强推理路径。利用理由，我们系统地研究了培训医学推理模型的最佳实践，并发现将详细的思想链（COT）推理与简洁的答案摘要结合起来，产生了最有效的微调策略。基于此策略，我们训练Reason Med-7B，该策略为低于10B模型设定了新的基准，在PubMedQA上的最佳优于4.17 \％，甚至超过Llama3.1-70B，占4.60 \％。

Title: KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs

Authors: Dingjun Wu, Yukun Yan, Zhenghao Liu, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09542
Pdf URL: https://arxiv.org/pdf/2506.09542
Copy Paste: [[2506.09542]] KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs(https://arxiv.org/abs/2506.09542)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.
摘要：检索增强的生成（RAG）通过基础外部知识的响应来提高事实准确性。但是，现有方法通常依赖于单个来源，即非结构化的文本或结构化知识。此外，它们缺乏激活相关知识的认知灵感机制。为了解决这些问题，我们提出了Insuned KG的抹布，该框架将KG集成到抹布系统中以实现扩散激活，这是一个能够构成概念关联和推理的认知过程。注入KG的RAG检索KG事实，相应地扩展了查询，并通过将语料库段落与结构化事实相结合，从而增强了生成，从而实现了以语义结构为基础的可解释的多源检索。我们通过偏好学习管道中的采样关键阶段，进一步改善了注入KG的抹布。五个质量检查基准的实验表明，注入KG的破布始终优于香草抹布（3.8％至13.8％）。此外，当集成到自lag中时，INSUD INSUNE的抹布将带来进一步的性能增长，证明了其有效性和多功能性，作为基于语料库的抹布方法的插件增强模块。

Title: Gender Bias in English-to-Greek Machine Translation

Authors: Eleni Gkovedarou, Joke Daems, Luna De Bruyne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09558
Pdf URL: https://arxiv.org/pdf/2506.09558
Copy Paste: [[2506.09558]] Gender Bias in English-to-Greek Machine Translation(https://arxiv.org/abs/2506.09558)
Keywords: gpt, prompt
Abstract: As the demand for inclusive language increases, concern has grown over the susceptibility of machine translation (MT) systems to reinforce gender stereotypes. This study investigates gender bias in two commercial MT systems, Google Translate and DeepL, focusing on the understudied English-to-Greek language pair. We address three aspects of gender bias: i) male bias, ii) occupational stereotyping, and iii) errors in anti-stereotypical translations. Additionally, we explore the potential of prompted GPT-4o as a bias mitigation tool that provides both gender-explicit and gender-neutral alternatives when necessary. To achieve this, we introduce GendEL, a manually crafted bilingual dataset of 240 gender-ambiguous and unambiguous sentences that feature stereotypical occupational nouns and adjectives. We find persistent gender bias in translations by both MT systems; while they perform well in cases where gender is explicitly defined, with DeepL outperforming both Google Translate and GPT-4o in feminine gender-unambiguous sentences, they are far from producing gender-inclusive or neutral translations when the gender is unspecified. GPT-4o shows promise, generating appropriate gendered and neutral alternatives for most ambiguous cases, though residual biases remain evident.
摘要：随着对包容性语言的需求的增加，对机器翻译（MT）系统的敏感性（增强性别刻板印象）的敏感性也不断增长。这项研究调查了两个商业MT系统中的性别偏见，Google Translate and Deepl，重点是研究的英语对语言对。我们解决了性别偏见的三个方面：i）男性偏见，ii）职业刻板印象和iii）反式型翻译中的错误。此外，我们探讨了引起GPT-4O作为缓解偏见的工具的潜力，该工具在必要时同时提供性别解释和性别中性替代方案。为了实现这一目标，我们介绍了Gendel，Gendel是一个手动制作的双语数据集，其中包含240个性别歧义且明确的句子，其中具有定型的职业名词和形容词。我们发现两个MT系统翻译中的性别偏见。尽管在明确定义性别的情况下，它们的表现都很好，但在女性性别毫不掩饰的句子中，Deepl的表现优于Google翻译和GPT-4O，但如果未指定性别时，它们远未产生性别包含性别或中性的翻译。 GPT-4O表现出希望，为大多数模棱两可的情况产生适当的性别和中性替代方案，尽管残留偏见仍然很明显。

Title: Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

Authors: Stefan Krsteski, Matea Tashkovska, Borjan Sazdov, Hristijan Gjoreski, Branislav Gerazov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09560
Pdf URL: https://arxiv.org/pdf/2506.09560
Copy Paste: [[2506.09560]] Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language(https://arxiv.org/abs/2506.09560)
Keywords: language model, llm
Abstract: The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at this http URL for source code, and at this http URL for pretrained model weights and data.
摘要：全球技术采用的增长伴随着对普通人群使用的新工具的需求。大型语言模型（LLMS）在这方面提供了一个很好的机会，但是它们的能力仍然有限，用于低资源语言，限制了讲此类语言的国家 /地区的应用程序。我们创建了几种资源，以促进LLM的采用并支持马其顿的研究进步。迄今为止，我们收集了最大的马其顿语料库，其中包括40GB的文本数据和总计3.5b字。为了支持对话应用程序，我们收集了一个106k-Instance指令数据集，该数据集经过精心构建，该数据集是在文化上扎根的。为了进行评估，我们构建了一个涵盖七个基准测试的马其顿评估套件。最后，我们在我们的策划数据集上训练国内雅克（Torey-Yak），这是一种最先进的8B参数模型，并使用新建造的基准套件对八个基线模型进行评估。我们的模型在所有基准测试中的8B参数范围内的所有现有模型都优于所有现有模型，并且实现了与最大10倍的模型相当的性能。此外，对母语者的定性分析表明，我们的模型比较大的模型优先于较大的同行，因此获得了语法正确性和文化适当性的更高评分。所有数据集，代码和模型权重都公开发布，为推进类似代表性的语言的LLMS奠定了基础。这些资源可在此HTTP URL上公开用于源代码，并在此HTTP URL上用于验证的模型权重和数据。

Title: From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies

Authors: Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09566
Pdf URL: https://arxiv.org/pdf/2506.09566
Copy Paste: [[2506.09566]] From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies(https://arxiv.org/abs/2506.09566)
Keywords: language model, llm, hallucination
Abstract: Integrating structured knowledge from Knowledge Graphs (KGs) into Large Language Models (LLMs) enhances factual grounding and reasoning capabilities. This survey paper systematically examines the synergy between KGs and LLMs, categorizing existing approaches into two main groups: KG-enhanced LLMs, which improve reasoning, reduce hallucinations, and enable complex question answering; and LLM-augmented KGs, which facilitate KG construction, completion, and querying. Through comprehensive analysis, we identify critical gaps and highlight the mutual benefits of structured knowledge integration. Compared to existing surveys, our study uniquely emphasizes scalability, computational efficiency, and data quality. Finally, we propose future research directions, including neuro-symbolic integration, dynamic KG updating, data reliability, and ethical considerations, paving the way for intelligent systems capable of managing more complex real-world knowledge tasks.
摘要：将结构化知识从知识图（kgs）整合到大语言模型（LLMS）中增强了事实基础和推理能力。该调查文件系统地研究了公斤和LLM之间的协同作用，将现有方法分为两个主要组：kg增强的LLM，从而改善了推理，减少幻觉并启用复杂的问题答案；和LLM扬名的公斤，可促进kg的建设，完成和查询。通过全面的分析，我们确定了关键的差距，并突出了结构化知识整合的相互益处。与现有调查相比，我们的研究独特地强调了可扩展性，计算效率和数据质量。最后，我们提出了未来的研究方向，包括神经符号的整合，动态KG更新，数据可靠性和道德考虑，为能够管理更复杂的现实世界知识任务的智能系统铺平了道路。

Title: Memorization in Language Models through the Lens of Intrinsic Dimension

Authors: Stefan Arnold
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09591
Pdf URL: https://arxiv.org/pdf/2506.09591
Copy Paste: [[2506.09591]] Memorization in Language Models through the Lens of Intrinsic Dimension(https://arxiv.org/abs/2506.09591)
Keywords: language model
Abstract: Language Models (LMs) are prone to memorizing parts of their data during training and unintentionally emitting them at generation time, raising concerns about privacy leakage and disclosure of intellectual property. While previous research has identified properties such as context length, parameter size, and duplication frequency, as key drivers of unintended memorization, little is known about how the latent structure modulates this rate of memorization. We investigate the role of Intrinsic Dimension (ID), a geometric proxy for the structural complexity of a sequence in latent space, in modulating memorization. Our findings suggest that ID acts as a suppressive signal for memorization: compared to low-ID sequences, high-ID sequences are less likely to be memorized, particularly in overparameterized models and under sparse exposure. These findings highlight the interaction between scale, exposure, and complexity in shaping memorization.
摘要：语言模型（LMS）容易在培训期间记住其数据的一部分，并在世代相传地发射它们，这引起了人们对隐私泄漏和知识产权披露的担忧。虽然先前的研究已经确定了上下文长度，参数大小和重复频率等属性，这是意外记忆的关键驱动因素，但对潜在结构如何调节这种记忆速率却一无所知。我们研究了固有维度（ID）的作用，这是一种对潜在空间中序列的结构复杂性的几何代理，在调节记忆中的作用。我们的发现表明，ID充当记忆的抑制信号：与低ID序列相比，高ID序列不太可能被记住，尤其是在过度参数化模型和稀疏暴露下。这些发现突出了尺度，暴露和复杂性之间的相互作用。

Title: Benchmarking Debiasing Methods for LLM-based Parameter Estimates

Authors: Nicolas Audinet de Pieuchon, Adel Daoud, Connor T. Jerzak, Moa Johansson, Richard Johansson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09627
Pdf URL: https://arxiv.org/pdf/2506.09627
Copy Paste: [[2506.09627]] Benchmarking Debiasing Methods for LLM-based Parameter Estimates(https://arxiv.org/abs/2506.09627)
Keywords: language model, llm
Abstract: Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method's performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.
摘要：大型语言模型（LLMS）提供了一种廉价但功能强大的注释文本的方法，但与专家相比，通常是不一致的。这些错误会偏向于人口参数（例如回归系数和因果效应）的估计。为了减轻这种偏见，研究人员开发了诸如基于设计的监督学习（DSL）和预测推理（PPI）之类的偏见方法，这些方法通过将LLM注释与数量有限的昂贵专家注释相结合来有效估计有效估计。尽管这些方法在理论假设下产生一致的估计，但尚不清楚它们如何在应用研究中遇到的有限大小样本中进行比较。我们做出了两个贡献：首先，我们研究每种方法的绩效如何通过专家注释的数量来缩放，突出了LLM偏见或有限专家标签的制度会显着影响结果。其次，我们比较了一系列任务中的DSL和PPI，发现尽管两者都在较大的数据集中达到较低的偏差，但DSL通常在降低偏差和经验效率方面比PPI胜过PPI，但其性能在整个数据集中的一致性较低。我们的发现表明，在偏见方法的水平上存在偏见变化的权衡，呼吁对开发指标进行更多研究，以量化其在有限样本中的效率。

Title: Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

Authors: Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09645
Pdf URL: https://arxiv.org/pdf/2506.09645
Copy Paste: [[2506.09645]] Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering(https://arxiv.org/abs/2506.09645)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: this https URL.
摘要：大型语言模型（LLMS）在各个领域表现出强大的归纳推理能力，但其可靠性受到过时的知识和幻觉的阻碍。检索授权的一代通过将LLM与外部知识扎根，从而减轻了这些问题；但是，大多数现有的破布管道都依赖于非结构化的文本，限制了解释性和结构化推理。知识图代表事实为关系三元组，提供了更加结构化和紧凑的替代方案。最近的研究探索了将知识图与LLMS整合以进行知识图检测（KGQA），其比例很大，采用了检索范围的范式。在此框架中，基于图的猎犬表现出了强烈的经验表现，但它们仍然面临着概括能力的挑战。在这项工作中，我们提出了RAPL，这是一个新颖的框架，用于在KGQA中进行有效有效的图形检索。 RAPL通过三个方面解决了这些限制：（1）将启发式信号与参数模型相结合的两阶段标签策略，以提供因果关系的监督；（2）一种模型无形图转换方法，以捕获内部和三层间相互作用，从而增强了表示能力；（3）一种基于路径的推理策略，可促进从注入的理性知识中学习，并通过结构化输入来支持下游推理者。从经验上讲，RAPL的最先进方法$ 2.66 \％-20.34 \％$ $，并大大降低了基于LLM的较小和更强大的基于LLM的推理者之间的性能差距，以及在交叉数据库设置下的差距，从而突出了其出色的检索能力和普遍性。代码可用：此HTTPS URL。

Title: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA

Authors: Nikolas Evkarpidi, Elena Tutubalina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09657
Pdf URL: https://arxiv.org/pdf/2506.09657
Copy Paste: [[2506.09657]] Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA(https://arxiv.org/abs/2506.09657)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.
摘要：本文介绍了一个为Semeval 2025任务8开发的系统：对表格数据的问题回答（QA）。我们的方法集成了几个关键组成部分：文本到SQL和文本对代码生成模块，自我纠正机制以及检索效果的生成（RAG）。此外，它包括一个端到端（E2E）模块，所有模块都由大型语言模型（LLM）策划。通过消融研究，我们分析了管道的不同部分的影响，并确定了该领域仍存在的挑战。在竞争的评估阶段，我们的解决方案的准确性为80％，在38个参与球队中排名前13位。我们的管道表明，开源模型的准确性有了显着提高，并且在桌面上的QA任务中实现了与专有LLM相当的性能。该代码可在GitHub存储库中找到。

Title: Query-Level Uncertainty in Large Language Models

Authors: Lihu Chen, Gaël Varoquaux
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09669
Pdf URL: https://arxiv.org/pdf/2506.09669
Copy Paste: [[2506.09669]] Query-Level Uncertainty in Large Language Models(https://arxiv.org/abs/2506.09669)
Keywords: language model
Abstract: It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.
摘要：对于大型语言模型来说，重要的是要了解其知识的边界，即识别已知和未知查询的机制。这种类型的意识可以帮助模型执行自适应推断，例如援引抹布，进行缓慢和深刻的思考或采用弃用机制，这有助于开发有效且值得信赖的AI。在这项工作中，我们提出了一种通过查询级别的不确定性来检测知识边界的方法，该方法旨在确定模型是否能够在不生成任何令牌的情况下解决给定查询。为此，我们介绍了一种名为\ emph {内部信心}的新颖且无训练的方法，该方法利用了跨层和代币的自我评估。事实质量检查和数学推理任务的经验结果表明，我们的内部信心可以胜过几个基线。此外，我们展示了我们提出的方法可用于有效的抹布和模型级联，这能够在保持性能的同时降低推理成本。

Title: Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data

Authors: Hao Xiong, Chuanyuan Tan, Wenliang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09672
Pdf URL: https://arxiv.org/pdf/2506.09672
Copy Paste: [[2506.09672]] Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data(https://arxiv.org/abs/2506.09672)
Keywords: language model, llm
Abstract: Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE. To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA). In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%
摘要：非结构化知识编辑（UKE）对于更新大语言模型（LLMS）的相关知识至关重要。它着重于非结构化输入，例如长期或自由形式的文本，这些文本是现实世界知识的常见形式。尽管以前的研究提出了有效的方法并对其进行了测试，但存在一些问题：（1）缺乏UKE的局部性评估，以及（2）基于微调（FT）的UKE方法的异常失败（FT）。为了解决这些问题，我们首先构建了两个数据集，即Unkebench-loc和Akew-loc（CF），通过将两个现有的UKE数据集从非结构化和结构化视图中延长了两个现有的UKE数据集。这使得对后编辑模型的局部性进行系统评估。此外，我们确定可能影响基于FT的方法性能的四个因素。基于这些因素，我们进行实验，以确定如何为UKE任务培训基于FT的良好方法，从而为未来的研究提供培训配方。我们的实验结果表明，具有最佳设置（FT-UKE）的基于FT的方法令人惊讶地强大，表现优于现有的最先进（SOTA）。在批处理编辑方案中，FT-UKE也表现出强劲的性能，随着批次大小的增长，其优于SOTA方法的优势增加，将平均度量铅从 +6.78％扩展到 +10.80％

Title: Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

Authors: Haoyi Song, Ruihan Ji, Naichen Shi, Fan Lai, Raed Al Kontar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09684
Pdf URL: https://arxiv.org/pdf/2506.09684
Copy Paste: [[2506.09684]] Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models(https://arxiv.org/abs/2506.09684)
Keywords: language model, llm
Abstract: Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at this https URL.
摘要：大型语言模型（LLM）已改变了自然语言处理，但是它们可靠的部署需要有效的不确定性量化（UQ）。现有的UQ方法通常是启发式方法，并且缺乏概率基础。本文首先提供了理论上的理由，即在LLMS中扰动在UQ中的作用。然后，我们引入了一个双随机步行透视图，将输入输出对建模为两个马尔可夫链，其过渡概率由语义相似性定义。在此基础上，我们提出了一个基于反向模型的完全概率框架，该框架通过评估通过系统扰动以给定输出条件的输入空间的多样性来量化不确定性。在此框架内，我们定义了一种新的不确定性度量，即投资。我们框架的关键优势在于它的灵活性：它支持不确定性度量，嵌入，扰动策略和相似性指标的各种定义。我们还提出了GAAP，这是一种基于遗传算法的扰动算法，从而增强了采样输入的多样性。此外，我们引入了一个新的评估度量，不确定性的温度敏感性（TSU），它直接评估不确定性而不依赖正确性作为代理。广泛的实验表明，侵入性的表现优于现有的语义UQ方法。重现结果的代码可以在此HTTPS URL上找到。

Title: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Authors: Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2506.09790
Pdf URL: https://arxiv.org/pdf/2506.09790
Copy Paste: [[2506.09790]] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation(https://arxiv.org/abs/2506.09790)
Keywords: gpt, chain-of-thought
Abstract: AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
摘要：AI生成的内容已从整体模型演变为模块化工作流程，尤其是在Comfyui等平台上，可以在创意管道中进行自定义。但是，制作有效的工作流程需要良好的专业知识来协调众多专业组件，并为用户提供陡峭的学习曲线。为了应对这一挑战，我们介绍了自动化工作流的第一个大型推理模型Comfyui-R1。从我们的4K工作流程的策划数据集开始，我们构建了长链（COT）推理数据，包括节点选择，工作流计划和代码级工作流程表示。 COMFYUI-R1通过两个阶段的框架进行了训练：（1）COT微调以进行冷启动，将模型适应Comfyui域；（2）强化学习，以激励推理能力，以精细颗粒规则 - 金属奖励的指导，确保格式有效性，结构完整性和节点级别的保真度。实验表明，我们的7B参数模型达到了97 \％格式的有效性率，以及高通过速率，节点级别和图形级别的F1得分，可显着超过先前采用的先前的最新方法，这些方法采用了领先的封闭源模型，例如GPT-4O和Claude系列。进一步的分析强调了推理过程的关键作用以及将工作流程转换为代码的优势。定性比较揭示了我们在与各种节点合成复杂的工作流程方面的力量，强调了AI艺术创作中长期COT推理的潜力。

Title: Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Authors: Andreas Säuberli, Diego Frassinelli, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09796
Pdf URL: https://arxiv.org/pdf/2506.09796
Copy Paste: [[2506.09796]] Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?(https://arxiv.org/abs/2506.09796)
Keywords: language model, llm
Abstract: Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
摘要：知道接受测试者如何回答教育评估中的项目对于测试开发，评估项目质量并提高测试有效性至关重要。但是，这个过程通常需要与人类参与者进行广泛的试点研究。如果大型语言模型（LLMS）表现出对测试项目的类似人类的响应行为，则可能会使使用它们作为试验参与者加速测试开发的可能性。在本文中，我们评估了来自18个指导调整LLM的反应的人类风格或心理计量学的合理性，并具有两个在三个主题的多项选择测试项目的公开数据集：阅读，美国历史和经济学。我们的方法基于来自心理计量学的两个理论框架，这些框架通常用于教育评估，经典测试理论和项目反应理论。结果表明，虽然较大的模型过于自信，但在用温度缩放进行校准时，它们的响应分布可能更像人性化。此外，我们发现与其他受试者相比，LLMS在阅读理解项目中倾向于与人类更好地相关。但是，相关性总体上不是很强，这表明LLM不应在零射击环境中用于试用教育评估。

Title: CoRT: Code-integrated Reasoning within Thinking

Authors: Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09820
Pdf URL: https://arxiv.org/pdf/2506.09820
Copy Paste: [[2506.09820]] CoRT: Code-integrated Reasoning within Thinking(https://arxiv.org/abs/2506.09820)
Keywords: language model, chain-of-thought
Abstract: Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at this https URL.
摘要：O1和DeepSeek-R1等大型推理模型（LRMS）在自然语言推理方面表现出了显着的进步（COT），但是在处理复杂的数学操作时，它们仍然效率低下或不准确。通过计算工具（例如，计算库和符号求解器）来解决这些限制是有希望的，但它引入了技术挑战：代码解释器（CI）将外部知识超出了模型的内部文本表示超出，因此直接组合并非有效。本文介绍了Cort，这是一个培训后培训框架，用于教授LRMS，以有效，有效地利用CI。作为第一步，我们通过提示工程来综合代码集成的推理数据来解决数据稀缺问题，从而策略性地插入了在适当位置以优化LRM-CI交互的不同提示。我们手动创建了30个高质量样本，并在其上进行了培训模型，范围从1.5B到32B参数，并具有监督的微调，拒绝微调和强化学习。我们的实验结果表明，在五个具有挑战性的数学推理数据集中，提示工程模型分别在DeepSeek-R1-Distill-Qwen-32b和DeepSeek-R1-Distill-Qwen-1.5b上分别实现了4 \％和8％的绝对改进。此外，与自然语言模型相比，提示工程模型使用32B模型使用约30 \％的令牌，对于1.5B模型，使用1.5B模型的代币。该模型和代码可在此HTTPS URL上找到。

Title: Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Authors: Tomas Peterka, Matyas Bohacek
Subjects: cs.CL, cs.AI, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2506.09847
Pdf URL: https://arxiv.org/pdf/2506.09847
Copy Paste: [[2506.09847]] Dataset of News Articles with Provenance Metadata for Media Relevance Assessment(https://arxiv.org/abs/2506.09847)
Keywords: language model, llm
Abstract: Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.
摘要：在当今的错误信息和虚假景观中，媒体操纵的领先形式是媒介的领先形式。试图检测这种做法的现有方法通常只考虑图像的语义是否与文本叙述相对应，只要描绘的对象或场景与手头的叙述相对应，就缺失了操纵。为了解决这个问题，我们介绍了新闻媒体出处数据集，这是带有出处标签图像的新闻文章的数据集。我们在此数据集上制定了两个任务：原始相关性（LOR）的位置以及原始相关性（DTOR）的日期和时间，以及六个大语言模型（LLMS）的目前基线结果。我们确定，虽然LOR上的零拍摄表现令人鼓舞，但在DTOR HINDERS上的性能为专业建筑和未来工作留下了空间。

Title: Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Authors: Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang
Subjects: cs.CL, cs.AI, math.ST, stat.ME
Abstract URL: https://arxiv.org/abs/2506.09853
Pdf URL: https://arxiv.org/pdf/2506.09853
Copy Paste: [[2506.09853]] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning(https://arxiv.org/abs/2506.09853)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
摘要：促使经过思考链（COT）在赋予具有复杂推理能力的大型语言模型（LLM）中起着必不可少的作用。但是，COT目前面临两个基本挑战：（1）足够，这确保了生成的中间推理步骤全面涵盖并证实最终结论；（2）必要性，它确定了真正答案的健全性必不可少的推理步骤。我们提出了一个因果框架，该框架通过充足和必要性的双重镜头来表征COT推理。纳入充足性和必要性的因果可能性不仅使我们不仅可以确定哪些步骤在逻辑上是足够的或必要的预测结果，而且还可以量化其在不同干预场景下对最终推理结果的实际影响，从而实现自动添加丢失的步骤和冗余裁缝。关于各种数学和常识性推理基准的广泛实验结果证实了推理效率的实质性提高，而无需牺牲准确性就可以降低令牌使用情况。我们的工作为提高LLM推理性能和成本效益提供了有希望的方向。

Title: Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs

Authors: Rodion Oblovatny, Alexandra Bazarova, Alexey Zaytsev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09886
Pdf URL: https://arxiv.org/pdf/2506.09886
Copy Paste: [[2506.09886]] Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs(https://arxiv.org/abs/2506.09886)
Keywords: language model, llm, hallucination, prompt
Abstract: We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.
摘要：我们通过分析及时和响应隐藏状态分布之间的概率差异，提出了一种新的方法来检测大语言模型（LLMS）中的幻觉。在违反直觉上，我们发现幻觉的反应与接地响应相比表现出较小的提示偏差，这表明幻觉通常是由于浅表性改造而不是实质性的推理而产生的。利用这种见解，我们提出了一种模型 - 内在检测方法，该方法将分布距离用作原则性的幻觉得分，从而消除了对外部知识或辅助模型的需求。为了增强灵敏度，我们采用了深厚的可学习核，它们会自动适应分布之间的细微几何差异。我们的方法的表现优于现有基线，表明在几个基准上表现出最先进的性能。即使没有内核训练，该方法仍保持竞争力，提供可幻觉检测的强大，可扩展的解决方案。

Title: The Emergence of Abstract Thought in Large Language Models Beyond Any Language

Authors: Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, Wenxuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09890
Pdf URL: https://arxiv.org/pdf/2506.09890
Copy Paste: [[2506.09890]] The Emergence of Abstract Thought in Large Language Models Beyond Any Language(https://arxiv.org/abs/2506.09890)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may "think" in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model's ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.
摘要：随着大型语言模型（LLM）继续提高，它们在各种语言中有效运作的能力已显示出明显的改进。初步研究观察到，即使响应非英语提示，LLM的隐藏激活也通常类似于英语。这导致了一个普遍的假设，即LLM可以用英语“思考”。但是，最新的结果显示出强大的多语言表现，甚至超过了其他语言的特定任务的英语表现，挑战了这一观点。在这项工作中，我们发现LLM逐渐开发了核心语言 - 不合命斯液参数空间 - 一个非常小的参数子集，其停用导致所有语言的绩效降低显着下降。这套紧凑而关键的一组参数是该模型超出单个语言超越单个语言的能力，支持与任何特定语言系统无关的抽象思想的出现。具体而言，我们识别与语言相关的神经元 - 在处理特定语言期间，该神经元始终被激活，并将其归类为共享（跨多种语言）或独家（特定于一种）。随着LLM的持续发展，我们观察到共享神经元的比例和功能重要性的显着增加，而独家神经元的影响则逐渐减少。这些共享的神经元构成了核心语言 - 敏捷参数空间的骨干，支持抽象思想的出现。在这些见解的推动下，我们提出了针对LLMS在不同发展阶段的语言敏捷水平量身定制的特定于神经元特定的培训策略。各种LLM家庭的实验支持我们的方法。

Title: PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

Authors: Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B. Cohen, Emine Yilmaz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09902
Pdf URL: https://arxiv.org/pdf/2506.09902
Copy Paste: [[2506.09902]] PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants(https://arxiv.org/abs/2506.09902)
Keywords: language model, llm, chat, agent
Abstract: Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
摘要：大型语言模型（LLMS）具有高级对话的AI助手。但是，有系统地评估这些助手的应用个性化的能力 - 在完成任务时适应了个人用户的偏好 - 怪物具有挑战性。现有的个性化基准专注于聊天，非转换任务或狭窄的域，未能捕获个性化任务辅助的复杂性。为了解决这个问题，我们介绍了私人语言，这是评估以任务为导向AI助手的个性化的全面基准。 Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success.通过与当前LLM助理跨不同任务的广泛实验，我们揭示了其个性化功能的显着差异，从而为推进对话性AI系统提供了重要的见解。

Title: Aspect-Based Opinion Summarization with Argumentation Schemes

Authors: Wendi Zhou, Ameer Saadat-Yazd, Nadin Kokciyan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09917
Pdf URL: https://arxiv.org/pdf/2506.09917
Copy Paste: [[2506.09917]] Aspect-Based Opinion Summarization with Argumentation Schemes(https://arxiv.org/abs/2506.09917)
Keywords: prompt
Abstract: Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.
摘要：评论是在在线购物中做出购买决策的客户的宝贵资源。但是，客户要仔细阅读大量评论并手动结论突出的意见是不切实际的，这促使需要自动意见摘要系统。先前的方法是提取性或抽象的，在自动产生以方面为中心的摘要中面临挑战。在本文中，我们提出了一个新颖的摘要系统，该系统不仅从一个方面的角度捕获了主要的观点，并具有支持证据，而且还适应了不同的领域，而无需依靠预定的一组方面。我们提出的框架，上文中的框架通过提取以方面为中心的论点并衡量其显着性和有效性，总结了与产品关键方面相关的观点。我们在现实世界数据集上进行实验，以证明与新方法相比，与新方法相比，我们的方法在捕获原始评论的各种观点方面的优越性。

Title: VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Authors: Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09942
Pdf URL: https://arxiv.org/pdf/2506.09942
Copy Paste: [[2506.09942]] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following(https://arxiv.org/abs/2506.09942)
Keywords: language model, llm
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.
摘要：具有可验证奖励（RLVR）的增强学习已成为增强大语模型（LLM）的关键技术，验证工程起着核心作用。但是，在以下说明中，RL的最佳实践仍未得到充实。在这项工作中，我们探讨了RL中的验证挑战，以进行以下说明，并提出了一种验证方法，该方法将基于规则的代码验证与大型推理模型（例如QWQ-32B）的基于LLM的验证结合在一起。为了支持这种方法，我们构建了一个高质量的跟随数据集的VerinStruct，其中包含大约22,000个具有相关验证信号的实例。我们将RL培训与Verif一起应用于两个模型，从而在几个代表性的遵循基准的基准中取得了重大改进。受过训练的模型在相当大小的模型中达到了最先进的性能，并概括了看不见的约束。我们进一步观察到它们的一般功能仍然不受影响，这表明可以将带有Verif的RL集成到现有的RL配方中，以增强整体模型性能。我们发布了数据集，代码和模型，以促进此HTTPS URL的未来研究。

Title: Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Authors: Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09944
Pdf URL: https://arxiv.org/pdf/2506.09944
Copy Paste: [[2506.09944]] Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking(https://arxiv.org/abs/2506.09944)
Keywords: language model, gpt, llm, long context
Abstract: Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
摘要：最近的工作已经确定了检索头（Wu等，2025b），这是负责检索长篇小说语言模型（LMS）中显着信息的一部分，这是通过其拷贝性 - 纸 - 纸 - 堆放任务中的拷贝性行为来衡量的。在本文中，我们介绍了QRhead（以查询为重点的检索头），这是一组改进的注意力头，从而增强了从长篇小说中的检索。我们使用少数来自现实世界任务的示例（例如，长篇小说QA）来汇总有关输入查询的注意分数来识别QRhead。我们进一步介绍了QR-retiaver，这是一种使用QRHEAD的累积注意力质量作为检索分数。我们通过选择最高的检索分数最相关的部分来使用QR-retriever进行长篇文化推理。在多跳的推理任务Longmemeval和Clipper上，在完整的环境中，这会产生超过10％的性能增长，并且表现优于强大的侦探。我们还评估了QRretriever在Beir基准测试中的重新级别，并发现它实现了强劲的零击性能，表现优于其他基于LLM的重新库，例如RankGpt。进一步的分析表明，QueryContext注意评分和任务选择对于识别具有强大下游效用的QRHEAD至关重要。总体而言，我们的工作贡献了通用回收猎犬，并为LMS的长期文化功能提供了解释性见解。

Title: Resa: Transparent Reasoning Models via SAEs

Authors: Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09967
Pdf URL: https://arxiv.org/pdf/2506.09967
Copy Paste: [[2506.09967]] Resa: Transparent Reasoning Models via SAEs(https://arxiv.org/abs/2506.09967)
Keywords: language model
Abstract: How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
摘要：我们如何通过利用其基本表示形式在语言模型中产生强大的推理？我们使用RESA回答这个问题，RESA是一个通过新颖有效的稀疏自动编码器调整（SAE调用）程序训练的1.5B推理模型的家族。该方法首先训练SAE从源模型中捕获推理能力，然后使用训练有素的SAE指导标准监督的微调过程，以在目标模型中引起此类能力，所有这些都使用经过验证的问题解答数据而没有任何推理痕迹。值得注意的是，当在进一步的RL后培训之前应用于某些基本型号时，SAE调整量将其> 97％的RL训练的推理性能占97％，同时将培训成本降低到2000倍至大约\ $ 1，而培训时间则> 450x> 450x降至20分钟左右。此外，当应用于轻度RL训练的型号（例如，在2 GPU的1小时内）时，它可以在AIME24上的43.33％Pass@1和90％Pass@1在AMC23上的Pass@1，仅需$ $ 1的额外费用。令人惊讶的是，通过SAE提取的推理能力可能既可以推广又模块化。通用性意味着从一个数据集中提取的能力仍在提高较大且重叠的语料库上的性能。模块化意味着从QWEN或QWEN-MATH提取的能力可以在测试时间附加到R1迪斯蒂尔模型上，而无需进行任何重新训练，并产生可比的收益。广泛的消融验证了这些发现，所有文物都是完全开源的。

Title: When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text

Authors: Hillary Dawkins, Kathleen C. Fraser, Svetlana Kiritchenko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09975
Pdf URL: https://arxiv.org/pdf/2506.09975
Copy Paste: [[2506.09975]] When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text(https://arxiv.org/abs/2506.09975)
Keywords: llm
Abstract: Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
摘要：一开始就检测AI生成的文本是一个困难的问题。由于互联网的短文长度和非正式的，特殊的语言，在社交媒体上检测AI生成的文本变得更加困难。尽管如此，解决这个问题仍然很重要，因为社交媒体代表了在线影响运动中的重要攻击向量，这可以通过使用大规模生产的AI生成的帖子来支持（或反对）特定的政策，决策或事件。我们通过一个相当复杂的威胁参与者的思维方式和资源来解决这个问题，并创建了505,159个AI生成的社交媒体帖子的数据集，其中包括开源，封闭式和精细调整的LLM的组合，涵盖11个不同的有争议的主题。我们表明，尽管可以在典型的研究假设中检测到有关生成模型知识和访问的帖子，但在更现实的假设中，攻击者不会将其微调模型释放给公众，但可检测性会急剧下降。通过人类研究证实了这一结果。消融实验突出了各种检测算法对微调LLM的脆弱性。该结果在所有检测域中都有含义，因为微调通常是适用且现实的LLM用例。

Title: Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Authors: Hiroshi Matsuda, Chunpeng Ma, Masayuki Asahara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09983
Pdf URL: https://arxiv.org/pdf/2506.09983
Copy Paste: [[2506.09983]] Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs(https://arxiv.org/abs/2506.09983)
Keywords: language model, llm, hallucination, prompt
Abstract: Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
摘要：大型语言模型（LLM）的最新进展使各种任务都令人印象深刻。但是，标准提示通常会努力产生结构上有效和准确的输出，尤其是在依赖性解析中。我们提出了一种新颖的逐步指导策略，其中通用的言论部分标记先于句法头和依赖性标签的预测，以及一种简化的conll-us类输出格式，我们的方法实现了跨17种语言的通用依赖性数据，而无需幻觉或污染。我们进一步表明，多语言微调同时改善了跨语言概括性能。我们的结果突出了基于LLM的解析中明确推理步骤的有效性，并为基于支架的方法提供了可扩展的，格式一致的替代方案。

Title: Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages

Authors: Amel Muminovic, Amela Kadric Muminovic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.09992
Pdf URL: https://arxiv.org/pdf/2506.09992
Copy Paste: [[2506.09992]] Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages(https://arxiv.org/abs/2506.09992)
Keywords: language model, gpt, prompt
Abstract: Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.
摘要：在线有毒语言会造成真正的伤害，尤其是在具有有限适量工具的地区。在这项研究中，我们评估了大型语言模型如何处理塞尔维亚，克罗地亚和波斯尼亚语的有毒评论，这些语言具有有限的数据。我们从各种类别的视频中绘制了4,500个YouTube和Tiktok评论的数据集，并将其标记为跨不同类别的视频，包括音乐，政治，体育，建模，有影响力的内容，性别歧视的讨论和一般话题。以两种模式测试了四种模型（GPT-3.5 Turbo，GPT-4.1，Gemini 1.5 Pro和Claude 3 Opus）：零射击和上下文提升。我们测量了精度，召回，F1得分，准确性和假阳性率。包括简短的上下文摘要的平均召回率约为0.12，而F1得分提高了0.10，尽管有时会增加假阳性。最佳余额来自双子座的上下文授权模式，达到0.82的F1得分，准确性为0.82，而零射门的GPT-4.1则以精确性为单位，并具有最低的错误警报。我们展示了如何在低资源环境中改善有毒语言检测，并提出实用策略，例如改进的及时设计和阈值校准等实用策略。这些结果表明，仅迅速设计就可以在毒性检测中为欠缺的巴尔干语言社区带来有意义的收益。

Title: From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Authors: Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, Juan Cao
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.09996
Pdf URL: https://arxiv.org/pdf/2506.09996
Copy Paste: [[2506.09996]] From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring(https://arxiv.org/abs/2506.09996)
Keywords: language model, llm, prompt
Abstract: Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
摘要：尽管安全一致性已应用于大多数大型语言模型（LLMS），但LLM服务提供商通常将随后的适度部署为现实世界中的外部安全护栏。现有的主持人主要实践常规的完整检测，该检测决定了基于完整的LLM输出的有害性，从而导致高服务潜伏期。最近的工作更加关注部分检测，如果检测到有害性，主持人中间监督了一代，并尽早停止输出，但它们直接应用了经过完整检测范式训练的主持人，以使输出不完整，从而引入了培训 - 引入降低性能的训练缝隙。在本文中，我们探讨了如何形成一种本地支持部分检测的数据和模型解决方案。对于数据，我们构建了FineHarm，这是一个由29K及时响应对组成的数据集，并具有细粒度的注释，以为令牌级培训提供合理的监督。然后，我们提出了流媒体内容监视器，该监视器通过对响应和令牌级别标签的双重监督进行训练，并可以遵循LLM的输出流以及时判断有害性。实验表明，SCM在宏F1分数中获得0.95+，这与完全检测相当，仅在平均看到响应中的前18％的令牌。此外，SCM可以用作伪造的注释，以提高安全对准并带来比DPO更高的无害评分。