2025-06-13

Title: TaskCraft: Automated Generation of Agentic Tasks

Authors: Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Yang, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10055
Pdf URL: https://arxiv.org/pdf/2506.10055
Copy Paste: [[2506.10055]] TaskCraft: Automated Generation of Agentic Tasks(https://arxiv.org/abs/2506.10055)
Keywords: prompt, agent
Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.
摘要：需要使用自主权，工具使用和自适应推理解决多个步骤问题的代理任务正变得越来越重要。但是，现有的指令数据缺乏工具交互，而当前的代理基准依赖于昂贵的人类注释，从而限制了其可扩展性。我们介绍了\ textsc {taskcraft}，这是一种自动化工作流，用于生成具有执行轨迹的难度量表，多工具和可验证的代理任务。 TaskCraft使用基于深度和基于宽度的扩展来扩展原子任务，以在结构和分层上创建复杂的挑战。经验结果表明，这些任务改善了生成工作流程中的迅速优化，并增强了对代理基础模型的监督微调。我们提出了一个大规模的合成数据集，该数据集大约36,000个任务，在支持未来的代理调整和评估研究方面有不同的困难。

Title: A quantum semantic framework for natural language processing

Authors: Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar
Subjects: cs.CL, cs.AI, cs.IR, cs.IT
Abstract URL: https://arxiv.org/abs/2506.10077
Pdf URL: https://arxiv.org/pdf/2506.10077
Copy Paste: [[2506.10077]] A quantum semantic framework for natural language processing(https://arxiv.org/abs/2506.10077)
Keywords: language model, llm, agent
Abstract: Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. Large Language Models (LLMs) and other modern NLP systems face inherent limitations precisely because they operate within natural language itself, making them subject to the same interpretive constraints imposed by semantic degeneracy. In this work, we argue using Kolmogorov complexity that as an expression's complexity grows, the likelihood of any interpreting agent (human or LLM-powered AI) recovering the single intended meaning vanishes. This computational intractability suggests the classical view that linguistic forms possess meaning in and of themselves is flawed. We alternatively posit that meaning is instead actualized through an observer-dependent interpretive act. To test this, we conducted a semantic Bell inequality test using diverse LLM agents as ``computational cognitive systems'' to interpret ambiguous word pairs under varied contextual settings. Across several independent experiments, we found average CHSH expectation values ranging from 1.2 to 2.8, with several runs yielding values (e.g., 2.3-2.4) that significantly violate the classical boundary ($|S|\leq2$). This demonstrates that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.
摘要：语义退化代表了自然语言的基本特性，它超出了简单的多义，涵盖了潜在解释的组合爆炸，这些解释会随着语义表达的复杂性而增加。大型语言模型（LLM）和其他现代NLP系统正是由于它们在自然语言本身中的运作而面临固有的局限性，因此它们受到语义退化施加的相同解释性限制。在这项工作中，我们使用kolmogorov的复杂性进行了争论，即随着表达的复杂性的增长，任何解释剂（人类或LLM驱动的AI）的可能性恢复了单一的预期含义消失。这种计算棘手的性能表明，经典的观点是语言形式本身具有意义是有缺陷的。或者，我们认为，通过依赖观察者的解释性行为实现含义。为了进行测试，我们使用不同的LLM代理作为``计算认知系统''进行了语义铃不平等测试，以解释不同上下文设置下的模棱两可的单词对。在几个独立的实验中，我们发现平均CHSH期望值范围从1.2到2.8，几个运行产生的值（例如2.3-2.4）显着违反了经典边界（$ | s | \ leq2 $）。这表明歧义下的语言解释可以表现出非古典情境性，这与人类认知实验的结果一致。这些结果固有地暗示着基于经典的基于频繁的自然语言的分析方法必然是有损的。取而代之的是，我们建议贝叶斯风格的重复采样方法可以在上下文中提供更实际和适当的语言意义表征。

Title: Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Authors: Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10086
Pdf URL: https://arxiv.org/pdf/2506.10086
Copy Paste: [[2506.10086]] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information(https://arxiv.org/abs/2506.10086)
Keywords: language model, llm, chat, agent
Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.
摘要：本文介绍了一种新型的多机构系统，称为“聊天”，旨在促进工业资产的失败模式和效果分析（FMEA）文档的产生。 Thecough的聊天采用多种协作大语言模型（LLM）具有特定角色的代理，利用高级AI技术和动态任务路由来优化FMEA表的生成和验证。该系统中的一个关键创新是引入思想聊天的引入，其中动态，多人驱动的讨论可以迭代内容的内容。这项研究探讨了工业设备监控的应用领域，突出了关键挑战，并通过交互式，模板驱动的工作流和上下文感知的代理协作证明了思想聊天的潜力。

Title: When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs

Authors: Xiao Li, Joel Kreuzwieser, Alan Peters
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10095
Pdf URL: https://arxiv.org/pdf/2506.10095
Copy Paste: [[2506.10095]] When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs(https://arxiv.org/abs/2506.10095)
Keywords: language model, llm, prompt
Abstract: We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.
摘要：我们研究了大型语言模型对仅在令牌级别的实现中不同但保留相同语义意图的提示的提示，这是我们称之为及时差异的现象。我们提出了基于及时的语义转移（PBSS），这是一个诊断框架，用于在语义上等效的及时重新词下测量LLM的行为漂移。 PBS应用于十个约束任务，揭示了一致的，模型的响应转移，这表明与令牌化和解码有关的统计规律性。这些结果突出了模型评估稳定性在重新启动下被忽视的维度，并表明令牌化策略和解码动态可能有助于培训后服务不稳定的质量。

Title: ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering

Authors: Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang, Bihui Yu, Junnan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10116
Pdf URL: https://arxiv.org/pdf/2506.10116
Copy Paste: [[2506.10116]] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering(https://arxiv.org/abs/2506.10116)
Keywords: language model, gpt
Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.
摘要：最近，大型语言模型在响应之前通过长链推理显示出了显着的推理能力。但是，如何将此功能扩展到视觉推理任务仍然是一个开放的挑战。现有的多模式推理方法通过几个图像到文本转换将这种视觉推理任务转移到文本推理任务中，这些转换通常会丢失嵌入在可视化中的关键结构和语义信息，尤其是对于需要大量视觉详细信息的图表问答等任务。为了弥合这一差距，我们提出了Chartreasone，这是一个由代码驱动的小说两阶段框架，旨在在图表上实现精确的，可解释的推理。我们首先训练一个高保真模型，以将各种图表图像转换为结构化的Echarts代码，从而将布局和数据语义尽可能地保存。然后，我们设计了一个通用图表推理数据综合管道，该管道利用该预处理的传输模型自动，可缩减地生成图表推理轨迹，并利用代码验证器过滤出低质量的样本。最后，我们在合成图表推理数据集上使用监督微调和强化学习的结合训练最终的多模型模型，并在四个公共基准上进行实验结果清楚地证明了我们提议的夏威夷派人的有效性。它可以尽可能地保留图表的原始细节，并使用最先进的开源模型执行相当的性能，同时使用较少的参数，从而在外域设置中接近GPT-4O（例如GPT-4O）的性能。

Title: Unsupervised Elicitation of Language Models

Authors: Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10139
Pdf URL: https://arxiv.org/pdf/2506.10139
Copy Paste: [[2506.10139]] Unsupervised Elicitation of Language Models(https://arxiv.org/abs/2506.10139)
Keywords: language model
Abstract: To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.
摘要：为了指导下游任务的语言模型，当今训练后的范式依赖于人类指定所需的行为。但是，对于具有超人类能力的模型，获得高质量的人类监督是困难或不可能的。为了应对这一挑战，我们引入了一种新的无监督算法，内部连贯性最大化（ICM），以在其自己生成的标签上微调预算的语言模型，\ emph {没有外部监督}。在GSM8K验证，真实性和羊驼奖励建模任务上，我们的方法与在众包人类监督方面的金色监督培训和表现优于培训的培训相匹配。在LMS功能强烈的超人人类的任务上，我们的方法可以比对人类标签的培训更好地引起这些功能。最后，我们证明我们的方法可以改善Frontier LMS的培训：我们使用我们的方法来培训无监督的奖励模型，并使用强化学习来培训Claude 3.5 Haiku的助手。奖励模式和助手的表现都优于他们的人类监督者。

Title: When Large Language Models are Reliable for Judging Empathic Communication

Authors: Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Erina Farrell, Bruce Lambert, Matthew Groh
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.10150
Pdf URL: https://arxiv.org/pdf/2506.10150
Copy Paste: [[2506.10150]] When Large Language Models are Reliable for Judging Empathic Communication(https://arxiv.org/abs/2506.10150)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks' sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.
摘要：大型语言模型（LLMS）在基于文本的对话中产生同理心反应方面表现出色。但是，他们如何可靠地判断移情交流的细微差别？我们通过比较专家，人群工作者和LLMS如何在四个评估框架中注释了来自心理学，自然语言处理以及应用于200个真实世界对话的沟通的四个评估框架中的同理心交流，其中一位发言人分享个人问题，另一个人提供支持。利用3,150个专家注释，2,84444444444个LLM注释，我们评估了这三个注释群体之间的评价者间可靠性。我们发现专家一致性很高，但在整个框架的子组件方面有所不同，具体取决于其清晰度，复杂性和主观性。我们表明，与标准分类指标相比，专家协议为语境化LLM的性能提供了更有信息的基准。在所有四个框架中，LLM都始终取得了这一专家级别的基准，并超过了人群工人的可靠性。这些结果证明了LLM在具有适当基准的特定任务上验证时如何支持情绪敏感的应用中的透明度和监督，包括用作对话伴侣的使用。

Title: Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective

Authors: Yi Wang, Max Kreminski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10161
Pdf URL: https://arxiv.org/pdf/2506.10161
Copy Paste: [[2506.10161]] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective(https://arxiv.org/abs/2506.10161)
Keywords: language model, gpt, llm
Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs' ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs' story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.
摘要：故事的产生一直是大型语言模型（LLM）的重要应用。但是，由于自动评估方法的挑战以及手动评估的高成本和主观性，了解LLM的生产高质量故事的能力仍然有限。计算叙事学为构成一个好故事提供了宝贵的见解，该故事已应用于故事生成的符号叙事计划方法中。这项工作旨在通过使用叙事计划问题来加深对LLMS故事产生能力的理解。我们提出了一个基准，用于根据文献示例评估叙事规划的LLM，重点是因果关系，性格意图和戏剧性的冲突。我们的实验表明，GPT-4 Tier LLM可以以小规模产生因果关系，但是具有性格意图和戏剧性冲突的计划仍然具有挑战性，需要接受加强学习的LLM来进行复杂的推理。结果提供了有关LLM可以产生的故事规模的见解，同时从不同方面保持质量。我们的发现还突出了有趣的解决问题行为，并在游戏环境中应用LLM叙事计划的挑战和注意事项。

Title: Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Authors: Shubhashis Roy Dipta, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10202
Pdf URL: https://arxiv.org/pdf/2506.10202
Copy Paste: [[2506.10202]] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval(https://arxiv.org/abs/2506.10202)
Keywords: language model, llm
Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.
摘要：最近的方法表明，从大型语言模型（LLM）和视觉模型（VLMS）提取参数知识方面的熟练程度令人印象深刻。在这项工作中，我们考虑如何通过自动提取有关这些事件的潜在参数知识来改善与复杂现实世界事件相关的视频的识别和检索。我们提出Q2E：用于零击的多语言文本对视频检索的查询到事件分解方法，可在数据集，域，LLM或VLMS跨数据集进行自适应。我们的方法表明，我们可以使用LLMS和VLM中嵌入的知识分解查询来增强对其他过度简化的人类查询的理解。我们还展示了如何将我们的方法应用于基于视觉和语音的输入。为了结合这种多样化的多模式知识，我们采用基于熵的融合评分来进行零拍融合。通过对两个不同数据集和多个检索指标的评估，我们证明了Q2E的表现优于几个最先进的基线。我们的评估还表明，集成音频信息可以显着改善文本到视频检索。我们已经发布了代码和数据以供将来研究。

Title: Classifying Unreliable Narrators with Large Language Models

Authors: Anneliese Brei, Katharine Henry, Abhisheik Sharma, Shashank Srivastava, Snigdha Chaturvedi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10231
Pdf URL: https://arxiv.org/pdf/2506.10231
Copy Paste: [[2506.10231]] Classifying Unreliable Narrators with Large Language Models(https://arxiv.org/abs/2506.10231)
Keywords: language model, llm
Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.
摘要：通常，当我们与事件的第一人称说明互动时，我们会考虑叙述者（文本的主要发言人）是否可靠。在本文中，我们建议使用计算方法来识别不可靠的叙述者，即那些无意间歪曲信息的叙述者。从叙事学借用文学理论来定义基于各种文本现象的不同类型的不可靠的叙述者，我们提出了金枪鱼，这是来自多个领域的叙事数据集，包括博客文章，subreddit帖子，subreddit帖子，酒店评论和文学作品。我们定义了叙事内，叙事间和文本间的不可偿还的分类任务，并分析每个人的流行式开放权重和专有LLM的性能。我们建议从文献中学习，以对现实世界文本数据进行不可靠的叙述者分类。为此，我们尝试了很少的射击，微调和课程学习设置。我们的结果表明，此任务非常具有挑战性，并且有可能使用LLMS识别不可靠的叙述者。我们发布了我们的专家注销的数据集和代码，并邀请该领域的未来研究。

Title: ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese

Authors: Iago Alves Brito, Julia Soares Dollis, Fernanda Bufon Färber, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10245
Pdf URL: https://arxiv.org/pdf/2506.10245
Copy Paste: [[2506.10245]] ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese(https://arxiv.org/abs/2506.10245)
Keywords: llm
Abstract: We present ToxSyn-PT, the first large-scale Portuguese corpus that enables fine-grained hate-speech classification across nine legally protected minority groups. The dataset contains 53,274 synthetic sentences equally distributed between minorities groups and toxicity labels. ToxSyn-PT is created through a novel four-stage pipeline: (1) a compact, manually curated seed; (2) few-shot expansion with an instruction-tuned LLM; (3) paraphrase-based augmentation; and (4) enrichment, plus additional neutral texts to curb overfitting to group-specific cues. The resulting corpus is class-balanced, stylistically diverse, and free from the social-media domain that dominate existing Portuguese datasets. Despite domain differences with traditional benchmarks, experiments on both binary and multi-label classification on the corpus yields strong results across five public Portuguese hate-speech datasets, demonstrating robust generalization even across domain boundaries. The dataset is publicly released to advance research on synthetic data and hate-speech detection in low-resource settings.
摘要：我们提出了Toxsyn-Pt，这是第一个大规模的葡萄牙语料库，可在9个受合法保护的少数群体中实现精细的仇恨语音分类。该数据集包含53,274个合成句子，在少数群体和毒性标签之间平均分布。 Toxsyn-pt是通过新颖的四阶段管道创建的：（1）紧凑的手动策划种子；（2）使用指令调整的LLM进行几次射击；（3）基于释义的增强；（4）富集，以及其他中性文本，以遏制特定于小组的提示。由此产生的语料库是阶级平衡的，风格的多样性，并且没有主导现有葡萄牙数据集的社交媒体领域。尽管领域与传统基准有所不同，但对二进制和多标签分类的实验在五个公共葡萄牙仇恨语音言论数据集中都会产生强大的结果，即使在跨领域边界之间也表明了强大的概括。该数据集公开发布，以推动对低资源环境中的合成数据和仇恨语音检测的研究。

Title: Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models

Authors: Andrea Yaoyun Cui, Pengfei Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10268
Pdf URL: https://arxiv.org/pdf/2506.10268
Copy Paste: [[2506.10268]] Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models(https://arxiv.org/abs/2506.10268)
Keywords: language model
Abstract: Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a "false prior." To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.
摘要：语言模型本质上是令牌序列上的概率分布。自动回归模型通过迭代计算和采样从下一代币的分布产生句子。这种迭代采样引入了随机性，导致假设语言模型做出概率决策，类似于未知分布的采样。在此假设的基础上，先前的研究使用了模拟的Gibbs抽样，灵感来自旨在引发人类先验的实验，以推断语言模型的先验。在本文中，我们重新审视了一个关键的问题：语言模型是否具有贝叶斯大脑？我们的发现表明，在某些条件下，语言模型可以表现出接近确定的决策，例如即使在非零采样温度下，也会产生最大的似然估计。这挑战了抽样假设，并破坏了引起类似人类先验的先前方法。此外，我们证明，如果没有经过适当的审查，则具有模拟Gibbs采样的确定性行为的系统可以收敛到“错误的先验”。为了解决这个问题，我们提出了一种直接的方法，以区分吉布斯采样中的随机决策模式和确定性决策模式，有助于防止误导性语言模型先验的推断。我们尝试各种大型语言模型，以确定其在各种情况下的决策模式。我们的结果为了解大语模型的决策做出提供了关键的见解。

Title: ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs

Authors: Zige Wang, Qi Zhu, Fei Mi, Minghui Xu, Ruochun Jin, Wenjing Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10288
Pdf URL: https://arxiv.org/pdf/2506.10288
Copy Paste: [[2506.10288]] ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs(https://arxiv.org/abs/2506.10288)
Keywords: language model, llm
Abstract: Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.
摘要：基于梯度的数据影响近似已被利用，以在大型语言模型的监督微调中选择有用的数据样本。但是，整个微调过程中梯度的计算需要太多资源在实践中是可行的。在本文中，我们提出了一个有效的基于梯度的数据选择框架，并通过聚类和修改后的上置信度结合（UCB）算法。基于直觉，具有相似梯度特征的数据样本将具有相似的影响，我们首先在训练数据库上进行聚类。然后，我们将集群间数据选择构架为受约束的计算预算分配问题，并将其视为多臂强盗问题。修改了UCB算法以解决此问题。具体而言，在迭代采样过程中，记录了历史数据影响信息以直接估计每个群集的分布，并采用冷启动来平衡探索和剥削。各种基准测试的实验结果表明，我们提出的框架clusterucb可以与原始的基于梯度的数据选择方法获得可比的结果，同时大大降低了计算消耗。

Title: Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Authors: Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi, Imran Razzak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10292
Pdf URL: https://arxiv.org/pdf/2506.10292
Copy Paste: [[2506.10292]] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages(https://arxiv.org/abs/2506.10292)
Keywords: language model
Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick's efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.
摘要：通过最少的监督培训深度学习网络，由于其潜力减少了对广泛标记的数据的依赖，因此获得了极大的研究关注。尽管自我训练的方法已被证明在半监督学习中有效，但它们仍然容易受到嘈杂的伪标签的错误。此外，几个标签分类问题的最新方法要么是为资源丰富的语言（例如英语）设计的，要么涉及容易过度拟合的复杂级联模型。为了解决真正低资源语言环境中几个标签文本分类的持续挑战，在这种情况下，现有方法通常会与嘈杂的伪标签和域适应性抗争，我们提出了Flick。与先前依靠通用多群体伪标签或复杂级联体系结构的方法不同，Flick利用了基本的见解，即使高信任伪标签蒸馏出更广泛的最初簇可以极大地改善伪label的质量，尤其是在林语中尤为多样，尤其是在林语中，尤其是多元化的效果。 Flick介绍了一种新颖的伪标签改进组件，这是通过识别和利用表现最好的伪标签簇来与传统伪标记策略的背离。该组件专门学习通过专注于单群体内聚并利用自适应TOP-K选择机制，从而将高度可靠的伪标签蒸蒸日上。这个有针对性的改进过程对于减轻低资源数据中固有的错误的传播至关重要，从而使预先培训的语言模型仅具有少数真实的标签，可以对预培养的语言模型进行良好的微调。我们展示了Flick在14个不同数据集中的功效，包括挑战性低资源语言，例如阿拉伯语，乌尔都语和Setswana，以及英语，展示了其出色的性能和适应性。

Title: "Check My Work?": Measuring Sycophancy in a Simulated Educational Context

Authors: Chuck Arvin
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.10297
Pdf URL: https://arxiv.org/pdf/2506.10297
Copy Paste: [[2506.10297]] "Check My Work?": Measuring Sycophancy in a Simulated Educational Context(https://arxiv.org/abs/2506.10297)
Keywords: language model, gpt, llm
Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs "flip" their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.
摘要：这项研究研究了在模拟的教育环境下，用户提供的建议如何影响大语模型（LLM），在这种情况下，粘糊糊会带来重大风险。在五个实验条件下测试OpenAI GPT-4O和GPT-4.1模型类中的五个不同的LLM，我们表明响应质量基于查询框架有很大变化。如果学生提到不正确的答案，LLM的正确性可能会降低15个百分点，同时提到正确的答案可以提高准确性。我们的结果还表明，在较小的模型中，这种偏差更强，GPT-4.1-NANO模型的效果高达30％，而GPT-4O模型的效果为8％。我们对LLMS“翻转”答案的频率以及对令牌级别概率的调查的分析，证实模型通常将其答案更改为学生根据socophancy假设提到的答案选择的答案。这种自以为是的行为对教育平等具有重要意义，因为LLM可能会加速知识渊博的学生的学习，而相同的工具可能会加剧对知识渊博的学生的误解。我们的结果强调了在教育背景下更好地理解这种偏见的机制和方法的必要性。

Title: Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs

Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.10299
Pdf URL: https://arxiv.org/pdf/2506.10299
Copy Paste: [[2506.10299]] Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs(https://arxiv.org/abs/2506.10299)
Keywords: language model, llm
Abstract: Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech--text training in this study. We use interleaved speech--text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.
摘要：语音到语音翻译（S2ST）已使用大型语言模型（LLMS）进行了提前，这些模型在离散的语音单元上进行了微调。在这种方法中，从文本到语音的模态改编一直是一个问题。 LLM受到仅文本数据的培训，该数据提出了挑战，可以使它们适应语音模式，而语音到语音数据有限。为了解决培训难度，我们提出了预定的交织语音 - 本研究中的文本培训。我们使用交织的语音 - 文本单元而不是训练期间的语音单元，在培训中，一致的文本令牌在单词层面上交错。随着培训的进行，我们逐渐降低文本的比例，以促进从文本到语音的渐进式形态适应。我们通过对CVSS数据集的S2ST进行微调Llama3.2-1b进行实验评估。我们表明，所提出的方法始终提高翻译性能，尤其是对于培训数据有限的语言。

Title: Code Execution as Grounded Supervision for LLM Reasoning

Authors: Dongwon Jung, Wenxuan Zhou, Muhao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10343
Pdf URL: https://arxiv.org/pdf/2506.10343
Copy Paste: [[2506.10343]] Code Execution as Grounded Supervision for LLM Reasoning(https://arxiv.org/abs/2506.10343)
Keywords: language model, llm, chain-of-thought
Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
摘要：通过培训大型语言模型（LLMS）通过经过思考链（COT）监督的培训已被证明有效地增强了推理能力。但是，获得可靠，准确的推理监督仍然是一个重大挑战。我们提出了一种可扩展的方法，用于通过利用程序执行的确定性来生成高质量的COT监督数据集。与现有的推理数据集生成方法依赖于昂贵的人类注释或容易出错的LLM生成的COT不同，我们的方法从代码执行中提取了可验证的，逐步推理的痕迹，并将其转换为自然语言COT推理。关于跨各个领域的推理基准测试的实验表明，我们的方法有效地使LLM具有跨不同任务的可转移推理能力。此外，消融研究验证了我们的方法会产生高度准确的推理数据，并通过减少毫无意义的重复和过度思考来降低推断期间的总体令牌长度。

Title: TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

Authors: Xiaohan Yu, Pu Jian, Chong Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.10380
Pdf URL: https://arxiv.org/pdf/2506.10380
Copy Paste: [[2506.10380]] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning(https://arxiv.org/abs/2506.10380)
Keywords: llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at this https URL.
摘要：检索型发电（RAG）在开放域问答中表现出了相当大的有效性。但是，当应用于构成文本和表格组件的异质文档时，现有的RAG方法将显示出关键的局限性。扁平表和分块策略的普遍做法破坏了内在的表格结构，导致信息丢失，并破坏了LLM在多跳，全球查询中的推理能力。为了应对这些挑战，我们提出了Tablerag，这是一个混合框架，统一了文本理解和对表格数据的复杂操作。 tablerag迭代分为四个步骤：上下文敏感的查询分解，文本检索，SQL编程和执行以及组成中间答案生成。我们还开发了Heteqa，这是一种新颖的基准测试，旨在评估多跳的异质推理能力。实验结果表明，TableAg始终在公共数据集和我们的Heteqa上均表现出现有的基准，从而建立了一个新的用于异质文档问题的最先进的问题。我们在此HTTPS URL上释放tablerag。

Title: PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Authors: Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10406
Pdf URL: https://arxiv.org/pdf/2506.10406
Copy Paste: [[2506.10406]] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier(https://arxiv.org/abs/2506.10406)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.
摘要：大型语言模型（LLM）在复杂的推理任务中表现出了令人印象深刻的能力，但他们仍然很难可靠地验证自己的产出的正确性。现有的验证挑战解决方案通常取决于单独的验证器模型或需要多个阶段的自校正训练管道，从而限制可扩展性。在本文中，我们将政策作为生成验证者（PAG）提出，这是一个简单有效的框架，通过在统一的多转变强化学习（RL）范式中在策略和验证者角色之间交替来使LLMs自我纠正。与先前的方法不同，无论模型置信度如何，PAG都会始终产生第二次尝试，介绍了选择性修订机制：该模型仅在其自己的生成验证步骤检测错误时才修改其答案。这种验证的工作流程不仅减轻了模型崩溃，而且共同增强了推理和验证能力。各种推理基准的广泛实验突出了PAG的双重进步：作为一项政策，它提高了直接生成和自我纠正的准确性；作为一个验证者，其自我验证的表现优于自我矛盾。

Title: Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Authors: Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10415
Pdf URL: https://arxiv.org/pdf/2506.10415
Copy Paste: [[2506.10415]] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?(https://arxiv.org/abs/2506.10415)
Keywords: language model, llm
Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at this https URL.
摘要：本文介绍了TEMPVS基准，该基准的重点是图像序列中多模式大语言模型（MLLM）的时间基础和推理能力。 TEMPV由三个主要测试（即事件关系推理，句子排序和图像顺序）组成，每个测试都有基本的接地测试。 TEMPV要求MLLM同时依靠视觉和语言方式来了解事件的时间顺序。我们评估了38个最先进的MLLM，表明模型难以解决tempv，与人类能力相比，性能差距很大。我们还提供细粒度的见解，为未来的研究提出了有希望的方向。我们的TEMPVS基准数据和代码可在此HTTPS URL上找到。

Title: Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting

Authors: Avneet Kaur, Arnav Arora
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10421
Pdf URL: https://arxiv.org/pdf/2506.10421
Copy Paste: [[2506.10421]] Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting(https://arxiv.org/abs/2506.10421)
Keywords: language model
Abstract: Framing used by news media, especially in times of conflict, can have substantial impact on readers' opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.
摘要：新闻媒体使用的框架，尤其是在冲突时期，可能会对读者的看法产生重大影响，并可能加剧冲突本身。当前关于冲突框架主题的研究由于其定性性质而具有有限的见解，或者仅查看表面级别的通用框架而不会更深入。在这项工作中，我们确定了冲突研究中的先前工作的战争与和平新闻的指标，在报道以色列 - 巴勒斯坦战争的新闻文章中。为了进行分析，我们使用计算方法，结合了框架语义和大型语言模型的组合来识别交流框架及其与语言框架的联系。我们的分析表明，更高的关注对基于战争的报告而不是基于和平的报告。我们还在整个美国，英国和中东新闻媒体上的报告中表现出了很大的差异，这是冲突的袭击者和受害者是媒体中的偏见。

Title: Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

Authors: Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo, Yuan Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10446
Pdf URL: https://arxiv.org/pdf/2506.10446
Copy Paste: [[2506.10446]] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty(https://arxiv.org/abs/2506.10446)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem's complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model's overall performance. Specifically, we manage the model's reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.
摘要：大型语言模型（LLMS）在推理能力方面表现出重大进步，在各种具有挑战性的基准上表现良好。已经引入了诸如经过思想链的促进链的技术，以进一步改善推理。但是，这些方法通常会产生更长的输出，从而增加计算潜伏期。尽管某些方法使用强化学习来缩短推理，但它们通常在不考虑问题的复杂性的情况下采取统一的惩罚，从而导致次优结果。在这项研究中，我们试图通过提高简单问题的简洁性来提高LLM推理的效率，同时保留足够的推理以使其准确性更复杂，从而提高模型的整体性能。具体来说，我们通过分配奖励功能并包括对产出长度的新惩罚来管理模型的推理效率。我们的方法在三个数据集的基准评估中产生了令人印象深刻的结果：GSM8K，MATH500和AIME2024。对于相对简单的数据集GSM8K和Math500，我们的方法有效地缩短了输出长度，同时保持或提高了准确性。关于越来越苛刻的AIME2024数据集，我们的方法提高了准确性。

Title: Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Authors: Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10486
Pdf URL: https://arxiv.org/pdf/2506.10486
Copy Paste: [[2506.10486]] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers(https://arxiv.org/abs/2506.10486)
Keywords: llm
Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.
摘要：针对表的科学索赔验证通常需要预测给定表是否支持或驳斥索赔。但是，我们认为仅预测最终标签是不够的：它几乎没有揭示模型的推理，并且提供了有限的解释性。为了解决这个问题，我们将表格对准作为解释任务进行了重新审议，要求模型来识别索赔验证必不可少的表单元格。我们通过使用人类注销的细胞级理由扩展Scitab基准来构建一个新的数据集。注释者验证索赔标签并突出显示支持其决策所需的最小单元格集。在注释过程之后，我们利用收集的信息并提出分类法来处理模棱两可的案件。我们的实验表明，（i）合并表对齐信息改善了索赔验证绩效，（ii）大多数LLM在经常预测正确的标签的同时无法恢复人类对准的理由，这表明它们的预测并非源于忠实的推理。

Title: Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models

Authors: Aleksandra Sorokovikova, Pavel Chizhov, Iuliia Eremenko, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10491
Pdf URL: https://arxiv.org/pdf/2506.10491
Copy Paste: [[2506.10491]] Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models(https://arxiv.org/abs/2506.10491)
Keywords: language model, llm, prompt
Abstract: Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user's answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.
摘要：现代语言模型经过大量数据培训。这些数据不可避免地包括有争议的和刻板的内容，其中包含与性别，起源，年龄等有关的各种偏见。结果，模型表达了偏见的观点或基于指定的个性或用户的个性产生不同的结果。在本文中，我们研究了大语言模型（LLMS）中偏见的各种代理度量。我们发现，在多主题基准（MMLU）上评估具有预先贡献的人物模型会导致得分可忽略不计，并且大多是随机差异。但是，如果我们重新制定任务并要求模型对用户的答案进行评分，则会显示出更重要的偏见迹象。最后，如果我们向模型寻求工资谈判建议，我们会看到答案中有明显的偏见。随着LLM助理记忆和个性化的最新趋势，这些问题从不同的角度开始：现代LLM用户不需要预先预测其角色的描述，因为该模型已经知道他们的社会人口统计学。

Title: Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Authors: Sangmin Song, Juhwan Choi, JungMin Yun, YoungBin Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10504
Pdf URL: https://arxiv.org/pdf/2506.10504
Copy Paste: [[2506.10504]] Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models(https://arxiv.org/abs/2506.10504)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.
摘要：大型语言模型（LLMS）在零声对话状态跟踪（DST）中表现出了出色的性能，从而减少了对特定于任务的培训的需求。但是，常规的DST基准主要集中于结构化的用户代理对话，未能捕获现实世界多用户交互的复杂性。在这项研究中，我们评估了多用户DST中LLM的鲁棒性，同时最大程度地降低了数据集的构建成本。受基于LLM的数据注释的最新进展的启发，我们通过基于语音ACT理论生成第二个用户的话语来扩展现有的DST数据集。我们的方法系统系统地将第二用户的话语纳入对话中，从而在多用户设置中对LLM进行了受控评估。实验结果表明，与单用户DST相比，性能下降显着下降，强调了当前LLM在提取和跟踪对话中的局限性在多个扬声器中。我们的发现强调了未来的研究需要增强多用户DST方案的LLM，为更现实和强大的DST模型铺平了道路。

Title: Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

Authors: Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10508
Pdf URL: https://arxiv.org/pdf/2506.10508
Copy Paste: [[2506.10508]] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs(https://arxiv.org/abs/2506.10508)
Keywords: language model, llm
Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.
摘要：大型语言模型（LLM）通常由于缺乏背景知识和幻觉趋势而在知识密集型任务中挣扎。为了解决这些局限性，已深入研究了将知识图（KGS）与LLMS集成在一起。现有的KG增强LLM专注于补充事实知识，但仍在解决复杂问题方面挣扎。我们认为，完善事实之间的关系并将其组织成逻辑上一致的推理路径同样重要，就像事实知识本身一样重要。尽管它们具有潜力，但从kgs提取可靠的推理路径仍带来了以下挑战：图形结构的复杂性和多个生成的路径的存在，因此很难区分有用的和冗余的路径。为了应对这些挑战，我们提出了RRP框架来挖掘知识图，该框架将LLMS的语义优势与通过关系嵌入和双向分布学习获得的结构信息结合在一起。此外，我们引入了一个重新思考模块，该模块根据其重要性评估和完善推理路径。两个公共数据集的实验结果表明，与现有的基线方法相比，RRP实现了最先进的性能。此外，可以轻松地将RRP集成到各种LLM中，以以插件的方式增强其推理能力。通过生成针对特定问题量身定制的高质量推理路径，RRP蒸发了LLM推理的有效指导。

Title: SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

Authors: Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10622
Pdf URL: https://arxiv.org/pdf/2506.10622
Copy Paste: [[2506.10622]] SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis(https://arxiv.org/abs/2506.10622)
Keywords: language model, llm, agent
Abstract: The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.
摘要：会话AI系统的进步取决于用于培训，评估和基准测试的高质量，灵活和可重复的合成对话的可用性。 Sdialog是一个模块化的，可扩展的Python工具包，旨在应对合成对话生成和分析的挑战。通过利用教学调整的大语言模型（LLM），Sdialog为角色，编排和场景管理提供了抽象，从而为研究和开发创建了现实，多样化和可控制的对话数据。 Sdialog支持工作流，例如多代理模拟和场景驱动的生成，并代表了综合数据生成工具和框架的标准化迈出的一步，这是确保当今快速发展的研究景观中可重复性的重要进步。

Title: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

Authors: Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10627
Pdf URL: https://arxiv.org/pdf/2506.10627
Copy Paste: [[2506.10627]] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors(https://arxiv.org/abs/2506.10627)
Keywords: language model, gpt, llm, prompt
Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at this https URL.
摘要：本文介绍了我们的曲目1：BEA 2025中的错误识别的系统，该任务是关于AI驱动的导师的教学能力评估。该任务涉及评估导师的响应是否正确地识别学生的数学推理中的错误。我们探讨了四种方法：（1）从多个审慎的语言模型（LMS）中汇总的令牌嵌入的机器学习模型集合；（2）使用MLP分类器的[CLS]嵌入式嵌入式嵌入式句子转换器；（3）在象征级的历史和响应嵌入之间具有多头关注的历史感知模型；（4）带有大语言模型（LLM）的检索效果促进系统，即GPT 4O。我们的最终系统检索了语义上相似的示例，结构化提示，并使用模式指导的输出解析来产生可解释的预测。它表现优于所有基准，证明了将示例驱动的提示与LLM推理进行教学反馈评估的有效性。我们的代码可在此HTTPS URL上找到。

Title: Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

Authors: Tatsuya Hiraoka, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10641
Pdf URL: https://arxiv.org/pdf/2506.10641
Copy Paste: [[2506.10641]] Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters(https://arxiv.org/abs/2506.10641)
Keywords: language model, llm
Abstract: Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.
摘要：大型语言模型（LLMS）可以以高准确性来阐明令牌字符，但他们在更复杂的角色级任务中挣扎，例如识别代币中的组成子组件。在这项工作中，我们调查了LLM在拼写过程中如何内部代表和利用字符级信息。我们的分析表明，尽管对人类来说，拼写是一项简单的任务，但LLM并非以直接的方式处理它。具体而言，我们表明嵌入层没有完全编码字符级信息，尤其是在第一个字符之外。结果，LLMS依靠中间和较高的变压器层来重建角色级别的知识，我们在其拼写行为中观察到了独特的“突破”。我们通过三个互补分析来验证这种机制：探测分类器，知识神经元的识别和注意力重量的检查。

Title: Large Language Models for Detection of Life-Threatening Texts

Authors: Thanh Thi Nguyen, Campbell Wilson, Janis Dalins
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10687
Pdf URL: https://arxiv.org/pdf/2506.10687
Copy Paste: [[2506.10687]] Large Language Models for Detection of Life-Threatening Texts(https://arxiv.org/abs/2506.10687)
Keywords: language model, llm
Abstract: Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.
摘要：检测威胁生命的语言对于维护遇险中的个人，促进心理健康和福祉以及防止潜在的伤害和生命丧失至关重要。本文提出了一种使用大语言模型（LLM）识别威胁生命的文本的有效方法，并将其与传统方法（例如单词袋，单词嵌入，主题建模和双向编码器的代表）进行比较。我们使用其7B参数变体在不同数据集上的7B参数变体中微调了三个开源LLM，包括Gemma，Mismtral和Llama-2，这些变体是由阶级平衡，不平衡和极端不平衡方案构建的。实验结果表明，LLM在传统方法上的强劲表现。更具体地说，Mistral和Llama-2模型在平衡和不平衡的数据方案中都是表现最佳的人，而Gemma略有落后。我们采用了UPS采样技术来处理不平衡的数据方案，并证明，尽管这种方法受益于传统方法，但对LLM的影响却没有太大影响。这项研究证明了LLM在现实世界中威胁生命的语言检测问题的巨大潜力。

Title: Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet

Authors: Lorenzo Augello, John P. McCrae
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10715
Pdf URL: https://arxiv.org/pdf/2506.10715
Copy Paste: [[2506.10715]] Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet(https://arxiv.org/abs/2506.10715)
Keywords: language model
Abstract: Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.
摘要：Open English WordNet是在Ontolex-Lemon上发布的关键资源，作为语言链接的开放数据云的一部分。但是，资源中存在许多链接，在本文中，我们研究了如何在形容词之间建立超伴侣。我们提出了关于超声关系的理论讨论及其与名词和动词形成鲜明对比的形容词的区别。我们开发了一种新的资源，用于形容词超伴侣和微调大语模型，以预测形容词性超伴侣，这表明可以将Taxollama的方法适应此任务。

Title: PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

Authors: Ye Yu, Yaoning Yu, Haohan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10716
Pdf URL: https://arxiv.org/pdf/2506.10716
Copy Paste: [[2506.10716]] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models(https://arxiv.org/abs/2506.10716)
Keywords: llm, prompt, chain-of-thought
Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96\%\rightarrow96\%$ with Claude, $91\%\rightarrow92\%$ with Gemini) while reducing reasoning tokens by up to $87.5\%$ and cutting dollar cost by $69$--$82\%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.
摘要：大型推理模型（LRMS），例如Claude 3.7十四行诗和OpenAI O1，使用冗长的三链（COT）推理在数学基准上实现了强劲的性能，但是所产生的痕迹通常是不必要的冗长的。这会膨胀令牌的使用和成本，从而限制了对潜伏期敏感或API约束的设置的部署。我们介绍了前提（基于及时的有效数学推断和战略评估），这是一个迅速的框架，可在不修改模型权重的情况下减少推理开销。前提将痕量级诊断与梯度启发的及时优化相结合，以最大程度地减少冗余计算，同时保持答案的准确性。该方法通过多目标文本搜索来共同优化简洁性和正确性，该搜索能够平衡令牌长度并回答有效性。与先前的工作不同，前提是单件通行的黑盒接口，因此可以直接应用于商业LLMS。在GSM8K，SVAMP和MATH500上，我们匹配或超过基线准确度（$ 96 \％\ rightArrow96 \％$搭配Claude，$ 91 \％\％\％\ rightarrow92 \％$与Gemini搭配Gemini $），而将推理代价降低到$ 87.5 \％$ $ 87.5 \％，$ 82 $ 82 $ - 82 $ - 822 c。这些结果表明，及时级别的优化是有效LRM推断的实用且可扩展的途径，而不会损害推理质量。

Title: Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims

Authors: Priyanka Kargupta, Runchu Tian, Jiawei Han
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.10728
Pdf URL: https://arxiv.org/pdf/2506.10728
Copy Paste: [[2506.10728]] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims(https://arxiv.org/abs/2506.10728)
Keywords: retrieval-augmented generation
Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely "true" or "false" -- as is frequently the case with scientific and political claims. However, a claim (e.g., "vaccine A is better than vaccine B") can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., "how many biomedical papers believe vaccine A is more transportable than B?"). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
摘要：个人或实体提出的主张通常是细微的，不能明确地将其标记为完全“真”或“错误” - 科学和政治主张经常是这种情况。但是，可以将主张（例如，“疫苗A胜于疫苗B”）可以分解为其整体方面和次级值（例如，疗效，安全性，分布），这些方面易于验证。这使得更全面，结构化的响应为给定问题提供了全面的观点，同时还可以使读者优先考虑索赔中感兴趣的特定角度（例如，对儿童的安全性）。因此，我们提出了索赔，这是一个基于检索的基于生成的框架，用于自动构建在解决索赔并以特定于语料库特定观点来丰富索赔时通常考虑的方面的层次结构。该结构层次分配了输入语料库来检索相关的细分，这有助于发现新的次级概述。此外，这些细分可以使人们发现各个方面的不同观点（例如，支持，中立或反对）及其各自的流行率（例如，“有多少生物医学论文认为疫苗A比B可以运输比B更容易运输？”）。我们将主张应用于我们构造的数据集中列出的各种现实世界的科学和政治主张，展示了其在解构细微的主张并代表语料库中的观点时的稳健性和准确性。通过现实世界中的案例研究和人类评估，我们验证了其对多个基线的有效性。

Title: TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora

Authors: Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.10737
Pdf URL: https://arxiv.org/pdf/2506.10737
Copy Paste: [[2506.10737]] TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora(https://arxiv.org/abs/2506.10737)
Keywords: language model, llm
Abstract: The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.
摘要：科学领域的快速发展引入了组织和检索科学文献的挑战。传统上，专家策划的分类法已经满足了这一需求，但该过程既耗时又昂贵。此外，最近的自动分类法构建方法（1）过度依赖特定的语料库，牺牲通用性，或（2）在很大程度上取决于其预训练数据集中包含的大语言模型（LLMS）的一般知识，通常忽略了进化科学领域的动态性质。此外，这些方法无法说明科学文献的多方面性质，其中单个研究论文可能有助于多个维度（例如，方法论，新任务，评估指标，基准测试）。为了解决这些差距，我们提出了TAXOADAPT，该框架会在多个维度上动态适应LLM生成的分类法。 TaxoAdapt执行迭代层次分类，并根据语料库的局部分布扩大分类宽度和深度。多年来，我们展示了其在各种计算机科学会议上的最先进的表现，以展示其结构和捕捉科学领域的演变的能力。作为一种多维方法，TaxoAdapt产生的分类法比LLMS判断的最有竞争力的基线高出26.51％，相干性高50.41％。

Title: One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Authors: Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10766
Pdf URL: https://arxiv.org/pdf/2506.10766
Copy Paste: [[2506.10766]] One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers(https://arxiv.org/abs/2506.10766)
Keywords: language model, llm
Abstract: Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.
摘要：由于模型容量有限，高质量的数据和计算约束，对许多语言的大规模多语言大语模型（LLM）进行了挑战。此外，缺乏令牌仪的语言覆盖范围使得纯粹在训练阶段纯粹解决新语言的差距变得更加困难。在这项工作中，我们研究了训练早期的相对便宜的干预措施，改善了“语言可塑性”或模型培训后对新语言的适应能力。我们专注于令牌仪设计，并建议使用一种通用令牌仪，该通用令牌比主要的语言接受了更多的语言训练，以在预处理后扩大语言覆盖范围，以实现有效的适应。我们跨不同语言和不同培训策略的系统实验表明，通用令牌可以使语言适应性更高，而与特定于预科语言相比，获胜率高达20.2％。此外，通用的令牌仪还可以使对语言的可塑性更好，这些语言在令牌和训练中完全看不见的语言，高达5％的获胜率上涨。我们适应了对扩展的一组语言的适应性，其在预训练中所包含的大多数语言的表现妥协。

Title: Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs

Authors: Alberto Testoni, Iacer Calixto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10769
Pdf URL: https://arxiv.org/pdf/2506.10769
Copy Paste: [[2506.10769]] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs(https://arxiv.org/abs/2506.10769)
Keywords: language model, llm
Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.
摘要：准确且良好的不确定性估计对于在临床决策支持等高风险领域中部署大型语言模型（LLM）至关重要。我们对两个数据集，11个医学专业和六种问题类型的十个开源LLM（通用，生物医学和推理模型）进行了临床多项选择性答案的不确定性估计方法的精细评估。我们比较了基于标准的单一代方法和基于采样的方法，并提出了一个案例研究，该案例研究基于推理轨迹中的行为信号，探索简单的单通估计器。这些轻巧的方法接近语义熵的性能，同时仅需要一代。我们的结果揭示了各种专业和问题类型之间的实质性差异，强调了根据问题的性质和特定于模型的优势选择模型的重要性。

Title: Improving Named Entity Transcription with Contextual LLM-based Revision

Authors: Viet Anh Trinh, Xinlu He, Jacob Whitehill
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10779
Pdf URL: https://arxiv.org/pdf/2506.10779
Copy Paste: [[2506.10779]] Improving Named Entity Transcription with Contextual LLM-based Revision(https://arxiv.org/abs/2506.10779)
Keywords: language model, llm
Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM's reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.
摘要：随着建模的最新进展和越来越多的监督培训数据，自动语音识别（ASR）系统在一般语音上取得了显着的性能。但是，最先进的ASR的错误率（WER）对于指定实体仍然很高。由于指定的实体通常是最关键的关键字，因此错误地认识它们会影响所有下游应用程序，尤其是当ASR系统充当复杂系统的前端时。在本文中，我们引入了大型语言模型（LLM）修订机制，通过利用LLM的推理能力以及本地上下文（例如，讲义）包含一组正确的命名实体，以修改ASR预测中的错误命名实体。最后，我们介绍了NER-MIT-Opencouseware数据集，其中包含MIT课程的45小时数据进行开发和测试。在此数据集上，我们提出的技术可实现高达30 \％的相对减少命名实体。

Title: Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints

Authors: Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10800
Pdf URL: https://arxiv.org/pdf/2506.10800
Copy Paste: [[2506.10800]] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints(https://arxiv.org/abs/2506.10800)
Keywords: language model, llm
Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at this https URL.
摘要：在大型语言模型（LLM）中有效地更新多语言知识，同时保留了跨语言的一致的事实表示，但仍然是一个长期尚未解决的挑战。在为每种语言部署单独的编辑系统时，似乎可行，但由于需要管理多个模型，这种方法会造成大量成本。一个更有效的解决方案涉及将所有语言的知识更新集成到统一模型中。但是，跨语言进行顺序编辑通常会导致破坏性参数干扰，显着降低多语言概括和注入知识的准确性。为了应对这一挑战，我们提出了Langedit，这是一个新颖的空格约束框架，旨在精确隔离语言特定的知识更新。 Langedit的核心创新在于它可以将每种语言的参数更新投影到以前更新的子空间的正交补充上。这种方法在数学上可以保证更新独立性，同时保留多语言概括功能。我们对三种模型体系结构，六种语言和四个下游任务进行了全面的评估，这表明Langedit有效地减轻了参数干扰，并且表现优于现有的最新编辑方法。我们的结果突出了其在LLM中实现高效，准确的多语言知识更新的潜力。该代码可在此HTTPS URL上找到。

Title: ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Authors: Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10822
Pdf URL: https://arxiv.org/pdf/2506.10822
Copy Paste: [[2506.10822]] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization(https://arxiv.org/abs/2506.10822)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via this https URL.
摘要：最新的经营链（COT）提示的进步显着提高了大语言模型（LLMS）的推理能力。但是，这些方法通常会遭受过度思考的困扰，从而导致不必要的冗长或多余的推理痕迹。现有的方法试图通过策划培训LLM的多个推理链来减轻此问题，但是它们的有效性通常受到生成的数据的质量并容易过度拟合的限制。为了应对挑战，我们建议通过逐步试验（Recout）进行推理压缩，这是一种旨在平衡推理轨迹的准确性和长度的新方法。具体而言，Recut采用逐步探索机制和长短切换的采样策略，使LLMS能够逐步产生多种推理路径。对这些路径进行评估，并用于构建偏好对，以训练两个专门模型（Gemini LLMS） - 一个以推理精度进行了优化，另一种用于较短的推理。通过插值这两个模型的参数获得了最终的集成模型。多个数学推理数据集和骨干模型的实验结果表明，与各种基线相比，重新恢复的推理长度显着降低了约30-50％。所有代码和数据将通过此HTTPS URL发布。

Title: CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Authors: Alireza Salemi, Mukta Maddipatla, Hamed Zamani
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.10844
Pdf URL: https://arxiv.org/pdf/2506.10844
Copy Paste: [[2506.10844]] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training(https://arxiv.org/abs/2506.10844)
Keywords: retrieval augmented generation, retrieval-augmented generation, agent
Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.
摘要：本文介绍了MRAG，MRAG是由多代理检索型生成（RAG）框架，该框架由针对计划，搜索，推理和协调等子任务的专业代理组成。我们的系统使用带有奖励指导轨迹采样的自我训练范式来优化良好的协作并增强响应产生。在Sigir 2025 Liverag竞争期间，在Datamorgana衍生的数据集上进行了评估，MRAG的表现优于常规抹布基线。我们进一步分析了竞争成果，并通过案例研究展示了框架的优势，证明了其对复杂的现实抹布任务的功效。

Title: Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles

Authors: Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10848
Pdf URL: https://arxiv.org/pdf/2506.10848
Copy Paste: [[2506.10848]] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles(https://arxiv.org/abs/2506.10848)
Keywords: language model, llm
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.
摘要：基于扩散的语言模型（DLLM）已通过实现平行令牌生成并大大减少推理潜伏期，成为传统自回归LLM的有希望的替代方法。但是，DLLM的现有采样策略（例如基于置信度或半自动回应解码）通常会遭受静态行为，从而导致次要效率和有限的灵活性。在本文中，我们提出了慢速采样，这是一种新型的动态抽样策略，可以在探索性和加速解码阶段进行自适应交替。我们的方法以三个黄金原则为指导：确定性原理，融合原理和位置原理，这些原则可以自信地和有效地解码何时何地来控制令牌。我们将策略进一步与DLLM-CACHE集成，以减少冗余计算。跨基准和型号进行的大量实验表明，缓慢的抽样可在LLADA上达到15.63 $ \ times $ speedup，而与缓存结合使用时，精度下降的速度最低，最高34.22 $ \ times $。值得注意的是，我们的方法在吞吐量中的表现优于强大的自回归基线，例如Llama3 8b，这表明精心设计的采样可以解锁DLLM的全部潜力，从而获得快速和高质量的生成。

Title: Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment

Authors: Hongda Sun, Jiaren Peng, Wenzhong Yang, Liang He, Bo Du, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10877
Pdf URL: https://arxiv.org/pdf/2506.10877
Copy Paste: [[2506.10877]] Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment(https://arxiv.org/abs/2506.10877)
Keywords: prompt
Abstract: Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt. Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.
摘要：医学对话系统（MDS）已成为与患者进行多转，情境感知对话的关键在线平台。但是，现有的MD通常很难（1）确定相关的医学知识，并且（2）产生个性化的，医学上准确的反应。为了应对这些挑战，我们提出了MEDREF，这是一种新颖的MD，结合了知识精炼和动态及时调整。首先，我们采用知识精致的机制来滤除无关紧要的医疗数据，从而改善对响应中关键医疗实体的预测。此外，我们设计了一个全面的及时结构，结合了历史细节和明显的细节。为了实现对各种患者条件的实时适应性，我们实现了两个关键模块，即三重态过滤器和演示选择器，提供了系统提示中配备的适当知识和演示。关于MEDDG和KAMED基准测试的广泛实验表明，MEDREF在发电质量和医疗实体准确性方面都优于最先进的基线，从而强调了其对现实世界医疗保健应用程序的有效性和可靠性。

Title: Slimming Down LLMs Without Losing Their Minds

Authors: Qingda (Michael)Mai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10885
Pdf URL: https://arxiv.org/pdf/2506.10885
Copy Paste: [[2506.10885]] Slimming Down LLMs Without Losing Their Minds(https://arxiv.org/abs/2506.10885)
Keywords: language model, llm
Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.
摘要：本文研究并验证了微调对大语言模型性能的影响，重点是参数有效方法（Lora和Qlora）。我们评估了跨三个关键领域的模型功能：（1）常识性推理（HellasWag），（2）数学推理（GSM8K）和（3）多域知识（MMLU-CS）。我们的发现表明：（1）基于洛拉的方法在保持计算效率的同时有效地提高了特定于任务的性能，并且（2）性能在很大程度上取决于微调数据集和基准测试任务之间的对齐。该研究既提供了对参数有效机制的理论见解，又为开发人员提供了有限资源的有效LLM适应性的开发人员的实用指南。

Title: Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Authors: Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10887
Pdf URL: https://arxiv.org/pdf/2506.10887
Copy Paste: [[2506.10887]] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers(https://arxiv.org/abs/2506.10887)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
摘要：大型语言模型（LLMS）可以通过微调获取新知识，但是此过程表现出令人困惑的双重性：模型可以从新事实中显着概括，但也很容易幻觉不正确的信息。但是，这种现象的原因仍然很少理解。在这项工作中，我们认为这两种行为均源于一种称为脱节的推理（OCR）的单一机制：通过关联概念，甚至没有因果关系的概念来推断含义的能力。我们在五个突出的LLMS进行的实验证实，OCR确实可以驱动概括和幻觉，这取决于相关概念是否与因果关系相关。为了对这一现象建立严格的理论理解，然后将OCR形式化为合成的事实召回任务。我们从经验上表明，具有分解输出和价值矩阵的单层单头注意力变压器可以学会解决此任务，而具有组合权重的模型不能强调矩阵因素化的关键作用。我们的理论分析表明，OCR能力可以归因于梯度下降的隐式偏差，该梯度下降有利于最大程度地减少组合输出值基质的核标准的溶液。这种数学结构解释了为什么该模型学会将事实和含义与样本效率高的含义联系起来，而不管该相关性是因果关系还是仅仅是虚假的。最终，我们的工作为理解OCR现象提供了理论基础，为分析和减轻知识注入的不良行为提供了新的镜头。

Title: Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Authors: Lan Zhang, Marco Valentino, Andre Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10903
Pdf URL: https://arxiv.org/pdf/2506.10903
Copy Paste: [[2506.10903]] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning(https://arxiv.org/abs/2506.10903)
Keywords: language model, llm
Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.
摘要：自动化在形式上的数学推理中起着至关重要的作用，可以自动将自然语言语句自动翻译成正式的语言。尽管使用大语言模型（LLM）的最新进展显示出令人鼓舞的结果，但自动评估自动化的方法仍未得到充实。当人们转移到更复杂的领域（例如高级数学）时，人类评估需要大量的时间和域专业知识，尤其是随着基本陈述和背景知识的复杂性的增加。 LLM-AS-A-Gudge提出了一种自动化此类评估的有前途的方法。但是，现有方法通常采用粗粒且通用的评估标准，这限制了其对高级形式数学推理的有效性，在这种数学上，质量取决于细微的多个粒度维度。在这项工作中，我们通过引入一种系统的自动方法来评估自动化任务来解决这一差距。所提出的方法基于LLM法官的认识和正式扎根的集合（EFG），该合奏定义为包括逻辑保存（LP），数学一致性（MC），正式有效性（FV）和形式质量（FQ）的标准（LP），并导致透明的评估，这些评估是构成不同贡献因素的透明评估。我们验证了所提出的框架，以作为正式数学领域内自动化评估的代理。总体而言，我们的实验表明，LLM法官的EFG合奏是一个合适的新兴代理，与人类评估相比，与粗粒模型相比，与人类评估更密切相关，尤其是在评估正式质量时。这些发现表明，LLM-AS-gudges，尤其是在由定义明确的原子特性引导时，可以提供可扩展，可解释且可靠的支持，以评估正式的数学推理。

Title: Magistral

Authors: Mistral-AI: Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Léonard Blier, Lucile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le Scao, Yihan Wang, Adam Yang, Alexander H. Liu, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Andy Ehrenberg, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jean-Hadrien Chabran, Jean-Malo Delignon, Joachim Studnia, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Kush Jain, Lingxiao Zhao, Louis Martin, Luyu Gao, Lélio Renard Lavaud, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Maximilian Augustin, Mickaël Seznec, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patrick von Platen, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Romain Sauvestre, Rémi Delacourt, Sanchit Gandhi, Sandeep Subramanian, Shashwat Dalal, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yunhao Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10910
Pdf URL: https://arxiv.org/pdf/2506.10910
Copy Paste: [[2506.10910]] Magistral(https://arxiv.org/abs/2506.10910)
Keywords: llm
Abstract: We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.
摘要：我们介绍了Mistral的第一个推理模型和我们自己的可扩展强化学习（RL）管道的裁判官。我们不依靠现有的实现和RL痕迹从先前的模型中提取的，而是遵循一种基本的方法，仅依靠我们自己的模型和基础架构。值得注意的是，我们展示了一个堆栈，使我们能够探索LLM的纯RL训练的限制，提出了一种强制模型推理语言的简单方法，并证明单独使用文本数据的RL可以维护大部分初始检查点的功能。我们发现文本上的RL维护或改善了多模式的理解，以下指令和功能调用。我们提出了裁判媒体，仅用RL就接受了在Mistral Medium 3之上进行推理的培训，我们开源的裁判官Small（Apache 2.0）进一步包括来自裁判媒体的冷启动数据。

Title: Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Authors: Or Shafran, Atticus Geiger, Mor Geva
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10920
Pdf URL: https://arxiv.org/pdf/2506.10920
Copy Paste: [[2506.10920]] Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization(https://arxiv.org/abs/2506.10920)
Keywords: language model, gpt, llm
Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.
摘要：机械解释性的一个核心目标是确定大语言模型（LLMS）的正确分析单位，以因果关系解释其输出。虽然早期的工作集中在单个神经元上，但有证据表明神经元经常编码多个概念，促使人们转向分析激活空间的方向。一个关键问题是如何找到以无监督方式捕获可解释特征的方向。当前方法依赖于用稀疏的自动编码器（SAE）的字典学习，通常在残留的流动激活中训练以从头开始学习方向。但是，SAE经常在因果评估中挣扎，并且缺乏固有的解释性，因为他们的学习并未与模型的计算明确相关。在这里，我们通过将MLP激活直接用半非态矩阵分解（SNMF）来解决这些限制，从而使学习的特征是（a）共激活神经元的稀疏线性组合，以及（b）映射到其激活输入中，使其直接解释。在Llama 3.1，Gemma 2和GPT-2上进行的实验表明，SNMF得出的特征优于SAE，并且在因果转向方面具有强有力的监督基线（差异），同时与人隔离的概念一致。进一步的分析表明，特定的神经元组合在与语义相关的特征之间重复使用，从而揭示了MLP激活空间中的层次结构。这些结果将SNMF定位为简单有效的工具，用于识别可解释的特征并解剖LLMS中的概念表示。

Title: Dynamic Epistemic Friction in Dialogue

Authors: Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.10934
Pdf URL: https://arxiv.org/pdf/2506.10934
Copy Paste: [[2506.10934]] Dynamic Epistemic Friction in Dialogue(https://arxiv.org/abs/2506.10934)
Keywords: language model, llm, agent
Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of "epistemic friction," or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent's current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.
摘要：将大语言模型（LLM）与人类偏好保持一致的最新发展大大增强了其在人类协作方案中的效用。但是，这种方法通常会忽略“认知摩擦”的关键作用，或者在对新，冲突或模棱两可的信息中更新信念时遇到的固有抵抗。在本文中，我们将动态认知摩擦定义为对认知融合的抵抗力，其特征在于代理人当前的信仰状态与外部证据支持的新命题之间的错位。我们将其定位在动态认知逻辑的框架内（Van Benthem和Pacuit，2011年），在互动过程中，摩擦以非平凡的信念革命形式出现。然后，我们从一个位置的协作任务中进行了分析，该任务证明了这种认知摩擦模型如何有效地预测对话中的信念更新，然后我们讨论了如何自然地将信仰对齐模型作为一种自然而然的衡量认知耐药性或摩擦的量度，以使其更加复杂以适应现实世界对话方案的复杂性。

Title: Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Authors: Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10952
Pdf URL: https://arxiv.org/pdf/2506.10952
Copy Paste: [[2506.10952]] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training(https://arxiv.org/abs/2506.10952)
Keywords: language model
Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.
摘要：我们介绍了一种新颖的方法，将任何数据集分解为几个\ emph {meta-domains}的线性组合，这是一种新概念，这是一种旨在捕获数据集的关键基础特征的新概念。 \ textsc {domain2vec}维护元域的词汇，并使用分类器将任何给定数据集分解为与该词汇上分布的域向量相对应的域向量。 These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better对齐，实现了较低的验证损失。此外，\ textsc {domain2vec}可以无缝集成到先前的工作中，以模拟域向量与LM性能之间的关系，从而大大提高了先前方法的效率和可扩展性。广泛的实验表明，\ textsc {domain2vec}有助于找到数据混合物，从而通过最小的计算开销来增强下游任务性能。具体来说，\ textsc {domain2vec}在pir-cc上仅使用$ 51.5 \％$的计算验证损失，当时训练了桩数据集的原始混合物时所需的计算损失。在同等计算预算下，\ textsc {domain2vec}平均提高了下游性能$ 2.83 \％$。

Title: ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Authors: Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng
Subjects: cs.CL, cs.AI, cs.CR, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10960
Pdf URL: https://arxiv.org/pdf/2506.10960
Copy Paste: [[2506.10960]] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark(https://arxiv.org/abs/2506.10960)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at this https URL.
摘要：大型语言模型（LLM）已越来越多地应用于自动化的有害内容检测任务，以帮助主持人确定违反政策的行为并提高内容审查的整体效率和准确性。但是，现有的有害内容检测的资源主要集中在英语上，中文数据集仍然很少，并且通常范围限制。我们为中国内容危害检测提供了全面的，专业注释的基准，该基准涵盖了六个代表性类别，完全由现实世界数据构建。我们的注释过程进一步产生了知识规则基础，该规则基础提供了明确的专家知识，以帮助LLMS进行中国有害内容检测。此外，我们提出了一个知识增强的基线，该基线既整合了人类宣传的知识规则和来自大语言模型的隐性知识，从而使较小的模型能够实现与最新的LLM相当的性能。代码和数据可在此HTTPS URL上找到。

Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Authors: Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.HC, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2506.10974
Pdf URL: https://arxiv.org/pdf/2506.10974
Copy Paste: [[2506.10974]] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science(https://arxiv.org/abs/2506.10974)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
摘要：大型语言模型（LLM）代理在解决现实世界数据科学问题方面表现出了巨大的潜力。 LLM驱动的数据科学代理人有望自动化整个机器学习管道，但其现实世界有效性仍然有限。现有的框架取决于僵化的，预定的工作流和僵化的编码策略；因此，它们仅在相对简单的经典问题上表现出色，并且无法捕捉人类从业者将经验专业知识带入复杂，创新的任务。在这项工作中，我们介绍自动智能，这是一个自适应，知识渊博的LLM代理框架，通过三个关键进步克服了这些不足的框架：（1）一个精选的专家知识基础，在域专家知识中依靠代理，（2）代理的知识渊博的树木搜索算法，从而策略性地探索了一种策略性的解决方案，并（3）策略性地制定了策略性的编码量，并进行了策略性的编码，并进行了策略性编码。对两个自动数据科学基准的评估表明，自动源可提供卓越的性能与最先进的基线。其他分析证实了有利的有效性，效率和定性解决方案质量，从而强调自动源是朝着全自动数据科学迈出的有效而强大的步骤。