2025-12-25

Title: Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Authors: Matyas Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20638
Pdf URL: https://arxiv.org/pdf/2512.20638
Copy Paste: [[2512.20638]] Uncovering Competency Gaps in Large Language Models and Their Benchmarks(https://arxiv.org/abs/2512.20638)
Keywords: language model, llm
Abstract: The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model's internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at this https URL.
摘要：大型语言模型（LLM）的评估在很大程度上依赖于标准化基准。这些基准为给定能力提供了有用的汇总指标，但这些汇总指标可能会掩盖（i）LLM较弱的特定子领域（“模型差距”）和（ii）基准本身的覆盖范围不平衡（“基准差距”）。我们提出了一种新方法，使用稀疏自动编码器（SAE）来自动发现这两种类型的间隙。通过提取 SAE 概念激活并计算基准数据中的显着性加权性能分数，该方法将评估基于模型的内部表示，并实现基准之间的比较。作为演示我们方法的示例，我们将该方法应用于两个流行的开源模型和十个基准。我们发现，这些模型在与阿谀奉承行为（例如，礼貌地拒绝请求或主张界限）和与安全讨论相关的概念相反的概念上始终表现不佳。这些模型差距与先前文献中出现的观察结果一致；我们的自动化、无监督方法能够在没有人工监督的情况下恢复它们。我们还观察到基准差距：许多评估的基准过多地体现了与服从、权威或遵循指令相关的概念，而缺少应属于其预期范围的核心概念。总之，我们的方法提供了一种基于表征的评估方法，实现了基准分数的概念级分解。 CG 不是取代传统的聚合指标，而是通过提供概念级分解来补充它们，该分解可以揭示模型得分的原因以及基准如何发展以更好地反映其预期范围。代码可从此 https URL 获取。

Title: SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention

Authors: Alexandros Christoforos, Chadbourne Davis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20724
Pdf URL: https://arxiv.org/pdf/2512.20724
Copy Paste: [[2512.20724]] SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention(https://arxiv.org/abs/2512.20724)
Keywords: long context
Abstract: Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.
摘要：随着序列长度的增加，基于扩散的长文本生成方法会遭受过高的计算成本和内存开销。我们引入了 SA-DiffuSeq，这是一种集成稀疏注意力的扩散框架，从根本上提高长文档建模的可扩展性。通过在扩散过程中选择性地分配注意力，SA-DiffuSeq 显着降低了计算复杂性，同时保持语义一致性和生成质量。我们方法的一个关键组成部分是针对稀疏注意力动态的软吸收状态，它可以稳定扩散轨迹并加速序列重建。这种设计提高了采样效率并提高了长距离依赖建模的精度。大量实验表明，SA-DiffuSeq 在训练效率和采样速度方面始终超越最先进的扩散基线，尤其是在扩展序列方面取得了显着的进步。这些特性使 SA-DiffuSeq 非常适合要求苛刻的长格式应用，例如科学写作、大规模代码生成和多轮长上下文对话。总的来说，我们的结果表明，将结构化稀疏性纳入扩散模型是高效且富有表现力的长文本生成的一个有前途的方向。

Title: TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Authors: Gül Sena Altıntaş, Malikeh Ehghaghi, Brian Lester, Fengyuan Liu, Wanru Zhao, Marco Ciccone, Colin Raffel
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20757
Pdf URL: https://arxiv.org/pdf/2512.20757
Copy Paste: [[2512.20757]] TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior(https://arxiv.org/abs/2512.20757)
Keywords: language model
Abstract: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
摘要：分词器提供了语言模型 (LM) 表示和处理文本的基本基础。尽管标记化很重要，但由于孤立地衡量标记化影响的挑战，人们对它在 LM 性能和行为中的作用知之甚少。为了满足这一需求，我们推出了 TokSuite，它是一系列模型和基准，支持研究标记化对 LM 的影响。具体来说，我们训练了 14 个模型，这些模型使用不同的分词器，但在其他方面都使用相同的架构、数据集、训练预算和初始化。此外，我们策划并发布了一个新的基准，专门衡量模型性能受到可能影响代币化的现实世界扰动的影响。 TokSuite 共同实现了模型分词器影响的稳健解耦，支持一系列新颖的发现，阐明了各种流行分词器各自的优点和缺点。

Title: Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization

Authors: Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn, Matteo Malgaroli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.20773
Pdf URL: https://arxiv.org/pdf/2512.20773
Copy Paste: [[2512.20773]] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization(https://arxiv.org/abs/2512.20773)
Keywords: chat
Abstract: Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.
摘要：真实的用户模拟对于训练和评估面向任务的对话（TOD）系统至关重要，但创建准确复制人类行为的模拟器仍然具有挑战性。有效模拟器的一个关键特性是能够揭示所评估系统的故障模式。我们提出了一个对抗性训练框架，通过生成器（用户模拟器）和鉴别器之间的竞争动态迭代地提高用户模拟器的真实感。应用于心理健康支持聊天机器人时，我们的方法表明，经过微调的模拟器在解决系统问题方面显着优于零样本基础模型，并且对抗性训练进一步增强了多样性、分布一致性和预测有效性。由此产生的模拟器在不同的聊天机器人配置中实现了模拟故障发生率与实际故障发生率之间的强相关性，同时保持了故障模式的低分布差异。经过三次对抗性迭代后，鉴别器的准确性急剧下降，这表明真实性有所提高。这些结果证明，对抗性训练是一种很有前途的方法，可用于在心理健康支持 TOD 领域创建真实的用户模拟器，从而在部署前实现快速、可靠且经济高效的系统评估。

Title: Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

Authors: Ramatu Oiza Abdulsalam, Segun Aroyehun
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.20780
Pdf URL: https://arxiv.org/pdf/2512.20780
Copy Paste: [[2512.20780]] Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles(https://arxiv.org/abs/2512.20780)
Keywords: language model, agent
Abstract: Recent work has explored the use of large language models for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple large language models respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that large language models approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, large language models tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent large language models exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.
摘要：最近的工作探索了使用大型语言模型来生成数学辅导响应，但仍不清楚它们的教学行为与人类专家实践的吻合程度如何。我们使用受控的回合级别比较来研究这个问题，其中专家导师、新手导师和多个大型语言模型对同一组数学补救对话回合做出反应。我们研究了教学策略和辅导反应的语言特征，包括重述和重述、强调准确性、词汇多样性、可读性、礼貌和能动性。我们发现，大型语言模型的教学质量平均接近专家水平，但在教学和语言方面表现出系统性差异。特别是，大型语言模型往往未充分利用人类专家导师特有的重述和重述策略，同时产生更长、词汇更多样化、更礼貌的回应。统计分析表明，重述和重述、词汇多样性和对准确性的追求与感知的教学质量呈正相关，而较高水平的代理和礼貌语言则呈负相关。总体而言，最近的大型语言模型表现出与专家人类导师相当的感知教学质量水平，同时依赖于不同的教学和语言策略。这些发现强调了在评估人类导师和智能辅导系统的辅导反应时分析教学策略和语言特征的价值。

Title: Investigating Model Editing for Unlearning in Large Language Models

Authors: Shariqah Hossain, Lalana Kagal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.20794
Pdf URL: https://arxiv.org/pdf/2512.20794
Copy Paste: [[2512.20794]] Investigating Model Editing for Unlearning in Large Language Models(https://arxiv.org/abs/2512.20794)
Keywords: language model, llm
Abstract: Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.
摘要：机器去学习的目的是从模型中删除不需要的信息，但许多方法对于具有大量参数的法学硕士来说效率低下，或者无法完全删除预期信息而不降低应保留知识的性能。模型编辑算法解决了更改模型中信息的类似问题，但它们侧重于将输入重定向到新目标，而不是完全删除该信息。在这项工作中，我们探索了编辑算法 ROME、IKE 和 WISE，并为不学习的环境设计了新的编辑目标。通过这项调查，我们表明，根据设置，模型编辑方法在遗忘质量方面可以超过基线遗忘方法。与传统的遗忘技术一样，它们很难在不损害整体模型性能的情况下封装要遗忘的内容的范围。

Title: Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Authors: Zhengyang Shan, Aaron Mueller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.20796
Pdf URL: https://arxiv.org/pdf/2512.20796
Copy Paste: [[2512.20796]] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?(https://arxiv.org/abs/2512.20796)
Keywords: language model
Abstract: We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.
摘要：我们研究了语言模型中的人口统计偏差机制如何独立于一般人口统计识别。使用多任务评估设置，其中人口统计数据与姓名、职业和教育水平相关联，我们衡量模型是否可以消除偏差，同时保留人口统计检测能力。我们比较了基于归因和基于相关性的偏差特征定位方法。我们发现 Gemma-2-9B 中的有针对性的稀疏自动编码器特征消融可以减少偏见，而不会降低识别性能：基于归因的消融可以减轻种族和性别职业刻板印象，同时保持姓名识别的准确性，而基于相关性的消融对于教育偏见更有效。定性分析进一步表明，消除教育任务中的归因特征会导致“先验崩溃”，从而增加整体偏差。这凸显了针对特定维度进行干预的必要性。总体而言，我们的结果表明，人口统计学偏差来自于特定任务的机制，而不是绝对的人口统计学标记，并且机械推理时间干预可以在不损害核心模型功能的情况下实现手术消除偏差。

Title: Semantic Deception: When Reasoning Models Can't Compute an Addition

Authors: Nathaniël de Leeuw, Marceau Nahon, Mathis Reymond, Raja Chatila, Mehdi Khamassi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.20812
Pdf URL: https://arxiv.org/pdf/2512.20812
Copy Paste: [[2512.20812]] Semantic Deception: When Reasoning Models Can't Compute an Addition(https://arxiv.org/abs/2512.20812)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs' capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task's symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models' performance on very simple tasks. They reveal limitations in current LLMs' ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model's training.
摘要：大型语言模型 (LLM) 越来越多地用于人类价值观受到威胁的情况，例如人类执行时涉及推理的决策任务。我们通过引入一个实验框架来测试法学硕士处理和操作不熟悉符号的能力，研究法学硕士对新颖符号表示的所谓推理能力。我们引入语义欺骗：符号因其形式而带有误导性语义关联的情况，例如嵌入特定上下文中，旨在探究法学硕士是否可以维持符号抽象或它们是否默认利用学习到的语义关联。我们使用新颖的符号重新定义标准数字和数学运算符，并要求法学硕士解决以这种改变的符号表示的简单计算。目标是：（1）评估法学硕士抽象和操纵任意符号系统的能力；（2）评估他们抵抗与任务的符号逻辑相冲突的误导性语义线索的能力。通过四个法学硕士的实验，我们表明语义线索会显着降低推理模型在非常简单的任务上的性能。它们揭示了当前法学硕士符号操作能力的局限性，并强调了过度依赖表面语义的倾向，表明思想链可能会放大对统计相关性的依赖。即使在法学硕士似乎正确遵循指示的情况下，语义线索仍然会影响基本能力。这些限制引起了道德和社会问题，破坏了将推理能力归因于法学硕士的广泛而有害的趋势，并表明法学硕士可能会失败，特别是在决策环境中，强大的符号推理至关重要，并且不应受到从模型训练中继承的残余语义关联的影响。

Title: EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Authors: Kumar Satvik Chaudhary, Chengshuai Zhao, Fan Zhang, Yung Hin Tse, Garima Agrawal, Yuli Deng, Huan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.20817
Pdf URL: https://arxiv.org/pdf/2512.20817
Copy Paste: [[2512.20817]] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading(https://arxiv.org/abs/2512.20817)
Keywords: language model
Abstract: Understanding how automated grading systems evaluate essays remains a significant challenge for educators and students, especially when large language models function as black boxes. We introduce EssayCBM, a rubric-aligned framework that prioritizes interpretability in essay assessment. Instead of predicting grades directly from text, EssayCBM evaluates eight writing concepts, such as Thesis Clarity and Evidence Use, through dedicated prediction heads on an encoder. These concept scores form a transparent bottleneck, and a lightweight network computes the final grade using only concepts. Instructors can adjust concept predictions and instantly view the updated grade, enabling accountable human-in-the-loop evaluation. EssayCBM matches black-box performance while offering actionable, concept-level feedback through an intuitive web interface.
摘要：了解自动评分系统如何评估论文对于教育工作者和学生来说仍然是一个重大挑战，特别是当大型语言模型充当黑匣子时。我们引入了 EssayCBM，这是一个与标题一致的框架，优先考虑论文评估中的可解释性。 EssayCBM 不是直接从文本预测成绩，而是通过编码器上的专用预测头评估八个写作概念，例如论文清晰度和证据使用。这些概念分数形成了一个透明的瓶颈，轻量级网络仅使用概念来计算最终成绩。教师可以调整概念预测并立即查看更新的成绩，从而实现负责任的人机交互评估。 EssayCBM 与黑盒性能相匹配，同时通过直观的 Web 界面提供可操作的概念级反馈。

Title: MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Authors: Zhan Qu, Michael Färber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20822
Pdf URL: https://arxiv.org/pdf/2512.20822
Copy Paste: [[2512.20822]] MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs(https://arxiv.org/abs/2512.20822)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
摘要：大型语言模型 (LLM) 越来越多地应用于医学，但由于对可靠性和安全性的担忧，其采用受到限制。现有的评估要么孤立地测试事实医学知识，要么在不验证正确性的情况下评估患者层面的推理，留下了严重的差距。我们推出了 MediEval，这是一个将 MIMIC-IV 电子健康记录 (EHR) 与根据 UMLS 和其他生物医学词汇构建的统一知识库联系起来的基准。 MediEval 在真实的患者环境中生成各种事实和反事实的医疗陈述，从而能够在共同考虑知识基础和情境一致性的四象限框架中进行系统评估。使用这个框架，我们识别了当前专有、开源和特定领域的法学硕士经常出现的关键故障模式，包括幻觉支持和真值倒置。为了解决这些风险，我们提出了反事实风险感知微调（CoRFu），这是一种基于 DPO 的方法，针对不安全的混淆进行不对称惩罚。 CoRFu 比基本模型提高了 +16.4 个宏 F1 点，并消除了真值反演错误，证明了更高的准确性和更高的安全性。

Title: Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Authors: NVIDIA: Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Faisal Ladhak, Fay Wang, Fei Jia, Felipe Soares, Feng Chen, Ferenc Galko, Frankie Siino, Gal Hubara Agam, Ganesh Ajjanagadde, Gantavya Bhatt
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20848
Pdf URL: https://arxiv.org/pdf/2512.20848
Copy Paste: [[2512.20848]] Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning(https://arxiv.org/abs/2512.20848)
Keywords: language model, gpt, chat, agent
Abstract: We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
摘要：我们推出 Nemotron 3 Nano 30B-A3B，一种 Mixture-of-Experts 混合 Mamba-Transformer 语言模型。 Nemotron 3 Nano 使用 25 万亿个文本标记进行预训练，其中包括超过 Nemotron 2 的超过 3 万亿个新的独特标记，然后在不同环境下进行监督微调和大规模 RL。 Nemotron 3 Nano 比上一代 Nemotron 2 Nano 实现了更高的精度，同时每次前向传递激活的参数不到一半。与 GPT-OSS-20B 和 Qwen3-30B-A3B-Thinking-2507 等类似大小的开放模型相比，它的推理吞吐量提高了 3.3 倍，同时在流行基准测试中也更加准确。 Nemotron 3 Nano 展示了增强的代理、推理和聊天能力，并支持高达 1M 令牌的上下文长度。我们在 Hugging Face 上发布了预训练的 Nemotron 3 Nano 30B-A3B Base 和后训练的 Nemotron 3 Nano 30B-A3B 检查点。

Title: How important is Recall for Measuring Retrieval Quality?

Authors: Shelly Schwartz, Oleg Vasilyev, Randy Sawaya
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.20854
Pdf URL: https://arxiv.org/pdf/2512.20854
Copy Paste: [[2512.20854]] How important is Recall for Measuring Retrieval Quality?(https://arxiv.org/abs/2512.20854)
Keywords: llm
Abstract: In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
摘要：在具有大型且不断发展的知识库的现实检索设置中，与查询相关的文档总数通常是未知的，并且无法计算召回率。在本文中，我们通过测量检索质量指标和基于 LLM 的响应质量判断之间的相关性，评估了几种处理此限制的既定策略，其中响应是从检索到的文档生成的。我们使用相对较少数量的相关文档 (2-15) 在多个数据集上进行实验。我们还引入了一种简单的检索质量度量，该度量在不需要了解相关文档总数的情况下表现良好。

Title: NVIDIA Nemotron 3: Efficient and Open Intelligence

Authors: NVIDIA: Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Anjulie Agrusa, Ankur Verma, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Cyril Meurillon, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Evgeny Tsykunov, Faisal Ladhak, Fay Wang, Fei Jia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20856
Pdf URL: https://arxiv.org/pdf/2512.20856
Copy Paste: [[2512.20856]] NVIDIA Nemotron 3: Efficient and Open Intelligence(https://arxiv.org/abs/2512.20856)
Keywords: agent
Abstract: We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
摘要：我们推出 Nemotron 3 系列型号 - Nano、Super 和 Ultra。这些模型提供强大的代理、推理和对话能力。 Nemotron 3 系列使用 Mixture-of-Experts 混合 Mamba-Transformer 架构来提供一流的吞吐量和高达 1M 令牌的上下文长度。 Super 和 Ultra 模型使用 NVFP4 进行训练，并结合 LatentMoE（一种提高模型质量的新颖方法）。两个较大的模型还包括 MTP 层，以加快文本生成速度。所有 Nemotron 3 模型均使用多环境强化学习进行后训练，支持推理、多步骤工具使用，并支持细粒度推理预算控制。 Nano 是最小的模型，在准确性方面优于同类模型，同时保持极高的推理成本效益。 Super 针对协作代理和大容量工作负载（例如 IT 票证自动化）进行了优化。 Ultra 是最大的模型，提供最先进的准确性和推理性能。 Nano 与其技术报告和本白皮书一起发布，而 Super 和 Ultra 将在未来几个月内发布。我们将公开发布模型权重、训练前和训练后软件、配方以及我们拥有再分配权的所有数据。

Title: Architectural Trade-offs in Small Language Models Under Compute Constraints

Authors: Shivraj Singh Bhatti
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20877
Pdf URL: https://arxiv.org/pdf/2512.20877
Copy Paste: [[2512.20877]] Architectural Trade-offs in Small Language Models Under Compute Constraints(https://arxiv.org/abs/2512.20877)
Keywords: language model
Abstract: We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.
摘要：我们对严格计算约束下的小语言模型进行了系统的实证研究，分析了架构选择和训练预算如何相互作用来确定性能。从线性下一个标记预测器开始，我们逐步引入非线性、自注意力和多层变压器架构，对 Tiny Shakespeare 的字符级建模以及 Penn Treebank (PTB) 和 WikiText-2 的单词级建模进行评估。我们使用测试负对数似然 (NLL)、参数计数和近似训练 FLOP 来比较模型，以表征准确性与效率的权衡。我们的结果表明，即使在小规模的情况下，基于注意力的模型在每 FLOP 效率方面也占主导地位，而在没有充分优化的情况下增加深度或上下文可能会降低性能。我们进一步研究了旋转位置嵌入（RoPE），发现在大型语言模型中成功的架构技术不一定会转移到小模型体系中。

Title: Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation

Authors: Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen, Jun Zhang, Jieping Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.20908
Pdf URL: https://arxiv.org/pdf/2512.20908
Copy Paste: [[2512.20908]] Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation(https://arxiv.org/abs/2512.20908)
Keywords: llm
Abstract: Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.
摘要：推理蒸馏越来越受到人们的关注。它通常利用大型教师模型来生成推理路径，然后用于微调学生模型，使其模仿教师在训练环境中的行为。然而，以前的方法缺乏对蒸馏模型功能的起源的详细分析。目前尚不清楚学生是否能够在新的测试环境中与老师保持一致的行为，或者是否回归到原来的输出模式，这引起了人们对蒸馏模型泛化的担忧。为了分析这个问题，我们引入了一个跨模型推理蒸馏溯源框架。对于蒸馏模型产生的每个动作（例如，一个句子），我们获得由教师、原始学生和相同上下文下的蒸馏模型分配的预测概率。通过比较这些概率，我们将每个动作分为不同的类别。通过系统地理清每个动作的来源，我们通过实验证明，在测试时背景下，蒸馏模型确实可以生成教师发起的动作，这些动作与蒸馏模型上观察到的表现相关并合理地解释。在此分析的基础上，我们进一步提出了一种教师指导的数据选择方法。与依赖启发式的先前方法不同，我们的方法直接比较训练数据上的师生分歧，提供原则性的选择标准。我们在多个代表性教师模型和不同学生模型中验证了我们方法的有效性。结果凸显了我们的溯源框架的实用性，并强调了它对推理精炼的承诺。我们希望与社区分享推理蒸馏溯源以及我们对推理蒸馏的见解。

Title: Neural Probe-Based Hallucination Detection for Large Language Models

Authors: Shize Liang, Hongzhi Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20949
Pdf URL: https://arxiv.org/pdf/2512.20949
Copy Paste: [[2512.20949]] Neural Probe-Based Hallucination Detection for Large Language Models(https://arxiv.org/abs/2512.20949)
Keywords: language model, llm, hallucination
Abstract: Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model's hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic this http URL overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training this http URL results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.
摘要：大型语言模型（LLM）擅长文本生成和知识问答任务，但它们容易生成幻觉内容，严重限制了它们在高风险领域的应用。当前基于不确定性估计和外部知识检索的幻觉检测方法存在局限性，即它们仍然会在高置信度水平下产生错误内容，并且严重依赖检索效率和知识覆盖率。相比之下，利用模型隐藏层状态的探测方法提供了实时和轻量级的优势。然而，传统的线性探针很难捕获深层语义中的非线性结构，这个http URL克服了这些限制，我们提出了一种基于神经网络的令牌级幻觉检测框架。通过冻结语言模型参数，我们采用轻量级 MLP 探针来执行高级隐藏状态的非线性建模。设计多目标联合损失函数来增强检测稳定性和语义消歧性。此外，我们建立了层位置探针性能响应模型，使用贝叶斯优化自动搜索最佳探针插入层并实现卓越的训练。LongFact、HealthBench 和 TriviaQA 上的 http URL 结果表明，MLP 探针在低误报条件下的准确性、召回率和检测能力方面显着优于最先进的方法。

Title: MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment

Authors: Mohammad Mahdi Abootorabi, Alireza Ghahramani Kure, Mohammadali Mohammadkhani, Sina Elahimanesh, Mohammad Ali Ali Panah
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20950
Pdf URL: https://arxiv.org/pdf/2512.20950
Copy Paste: [[2512.20950]] MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment(https://arxiv.org/abs/2512.20950)
Keywords: language model
Abstract: This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.
摘要：本文介绍了我们用于 SemEval-2025 任务 7 的系统：多语言和跨语言事实检查声明检索。在错误信息迅速传播的时代，有效的事实核查变得越来越重要。我们介绍 TriAligner，这是一种新颖的方法，它利用双编码器架构和对比学习，并结合不同模式的母语和英语翻译。我们的方法通过学习不同来源的相对重要性来有效地检索跨多种语言的声明。为了增强鲁棒性，我们使用大型语言模型进行高效的数据预处理和增强，同时结合硬负采样来改进表示学习。我们在单语言和跨语言基准上评估了我们的方法，证明检索准确性和事实检查性能较基线有显着提高。

Title: Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

Authors: Xiang Zhang, Jiaqi Wei, Yuejin Yang, Zijie Qiu, Yuhan Chen, Zhiqiang Gao, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Wanli Ouyang, Chenyu You, Siqi Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20954
Pdf URL: https://arxiv.org/pdf/2512.20954
Copy Paste: [[2512.20954]] Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models(https://arxiv.org/abs/2512.20954)
Keywords: language model, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.
摘要：思想链 (CoT) 提示在使用大型语言模型的自然语言处理中具有显着先进的任务解决能力。与标准提示不同，CoT 鼓励模型生成中间推理步骤、非答案标记，这有助于引导模型获得更准确的最终输出。这些中间步骤可以实现更复杂的推理过程，例如纠错、内存管理、未来规划和自我反思。然而，将 CoT 应用于非自然语言领域（例如蛋白质和 RNA 语言模型）尚不可能，主要是由于其标记空间（例如氨基酸标记）的表达能力有限。在这项工作中，我们提出并定义了语言表达能力的概念：给定语言使用其标记和语法来编码信息的能力。我们证明蛋白质语言的有限表达能力严重限制了 CoT 式推理的适用性。为了克服这个问题，我们首次在生物序列模型中引入反射预训练，这使得模型能够通过生成超越简单答案标记的辅助“思考标记”来进行中间推理。从理论上讲，我们证明了我们的增强标记集显着增强了生物语言表达能力，从而提高了模型的整体推理能力。在实验上，我们的预训练方法教会蛋白质模型进行自我纠正，并与标准预训练相比带来显着的性能提升。

Title: Automatic Replication of LLM Mistakes in Medical Conversations

Authors: Oleksii Proniakin, Diego Fajardo, Ruslan Nazarenko, Razvan Marinescu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.20983
Pdf URL: https://arxiv.org/pdf/2512.20983
Copy Paste: [[2512.20983]] Automatic Replication of LLM Mistakes in Medical Conversations(https://arxiv.org/abs/2512.20983)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at this https URL.
摘要：大语言模型 (LLM) 在临床环境中越来越多地使用量化推理质量、安全性和以患者为中心的多维标准进行评估。然而，在其他 LLM 模型中复制特定错误并不简单，而且通常需要手动操作。我们引入了 MedMistake，这是一个自动管道，可以提取法学硕士在患者与医生对话中犯下的错误，并将其转换为单次 QA 对的基准。我们的流程 (1) 在 LLM 患者和 LLM 医生之间创建复杂的对话数据，(2) 与 2 位 LLM 法官组成的委员会进行跨多个维度的评估，(3) 根据这些错误创建简化的单次 QA 场景。我们发布了 MedMistake-All，这是一个包含 3,390 个单次 QA 对的数据集，根据两位 LLM 法官的判断，GPT-5 和 Gemini 2.5 Pro 目前未能正确回答。我们聘请医学专家验证了 211/3390 个问题的子集 (MedMistake-Bench)，并用其对 12 个前沿法学硕士进行了最终评估：Claude Opus 4.5、Claude Sonnet 4.5、DeepSeek-Chat、Gemini 2.5 Pro、Gemini 3 Pro、GPT-4o、GPT-5、GPT-5.1、GPT-5.2、Grok 4、Grok 4.1、米斯特拉尔大号。我们发现 GPT 模型、Claude 和 Grok 在 MedMistake-Bench 上获得了最佳性能。我们在此 https URL 发布了经过医生验证的基准 (MedMistake-Bench) 以及完整数据集 (MedMistake-All)。

Title: Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Authors: Wei-Rui Chen, Vignesh Kothapalli, Ata Fatahibaarzi, Hejian Sang, Shao Tang, Qingquan Song, Zhipeng Wang, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21002
Pdf URL: https://arxiv.org/pdf/2512.21002
Copy Paste: [[2512.21002]] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation(https://arxiv.org/abs/2512.21002)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at this https URL.
摘要：将大型语言模型 (LLM) 的推理能力提炼为较小的学生模型通常需要对大量推理数据进行训练。然而，对带有提示 (P)、思路链 (CoT) 和答案 (A) 片段的冗长序列进行蒸馏使得该过程的计算成本很高。在这项工作中，我们研究了不同部分（P、CoT、A）之间的监督分配如何影响学生的表现。我们的分析表明，当提示和答案信息包含在其中时，仅对 CoT 令牌进行选择性知识蒸馏可能是有效的。基于这一见解，我们建立了一个截断协议来量化计算质量权衡作为序列长度的函数。我们观察到，仅对每个训练序列的前 $50\%$ 标记进行训练，平均可以在数学基准上保留 $\approx94\%$ 的全序列性能，同时将训练时间、内存使用量和 FLOP 减少约 $\%$ 约 $50\%$。这些发现表明，推理蒸馏受益于优先考虑早期推理标记，并为计算质量权衡提供了一个简单的杠杆。代码可从此 https URL 获取。

Title: Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Authors: Xiaofeng Shi, Qian Kou, Yuduo Li, Hua Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21017
Pdf URL: https://arxiv.org/pdf/2512.21017
Copy Paste: [[2512.21017]] Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy(https://arxiv.org/abs/2512.21017)
Keywords: language model, llm, chain-of-thought
Abstract: With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.
摘要：随着大型语言模型（LLM）的快速发展，思想链（CoT）组件对于复杂的推理任务变得非常重要。然而，在传统的监督微调 (SFT) 中，模型可能会过多地关注长度过长的 CoT 序列。这减少了对较短但重要的关键部分（最终答案）的关注，其正确性直接决定任务的成功和评估质量。为了解决这个限制，我们提出了 SFTKey，一种两阶段训练方案。在第一阶段，应用传统的SFT来确保正确的输出格式，而在第二阶段，仅对关键部分进行微调以提高准确性。跨多个基准和模型系列的大量实验表明，与传统 SFT 相比，SFTKey 的平均准确度提高了超过 5%，同时保留了生成正确格式的能力。总体而言，这项研究通过明确平衡 CoT 学习与答案相关标记的额外优化来推进 LLM 微调。

Title: Semantic Refinement with LLMs for Graph Representations

Authors: Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye, Chuxu Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.21106
Pdf URL: https://arxiv.org/pdf/2512.21106
Copy Paste: [[2512.21106]] Semantic Refinement with LLMs for Graph Representations(https://arxiv.org/abs/2512.21106)
Keywords: language model, llm
Abstract: Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.
摘要：图结构数据在其预测信号的来源方面表现出很大的异质性：在某些领域，节点级语义占主导地位，而在其他领域，结构模式发挥着核心作用。这种结构语义异质性意味着没有一个具有固定归纳偏差的图学习模型可以在不同的图域中进行最佳泛化。然而，大多数现有方法通过逐步注入新的归纳偏差来从模型方面解决这一挑战，考虑到现实世界图的开放多样性，这种偏差从根本上仍然受到限制。在这项工作中，我们采取以数据为中心的视角，并将节点语义视为任务自适应变量。我们提出了一种用于图表示学习的数据自适应语义细化框架 DAS，它将固定图神经网络（GNN）和大型语言模型（LLM）耦合在闭合反馈环中。 GNN提供隐式监督信号来指导LLM的语义细化，并且细化的语义被反馈以更新相同的图学习器。我们在富含文本和无文本的图表上评估我们的方法。结果显示，结构主导的图得到了一致的改进，同时在语义丰富的图上保持了竞争力，证明了结构语义异质性下以数据为中心的语义适应的有效性。

Title: Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Authors: Eduard Stefan Dinuta, Iustin Sirbu, Traian Rebedea
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.21107
Pdf URL: https://arxiv.org/pdf/2512.21107
Copy Paste: [[2512.21107]] Semi-Supervised Learning for Large Language Models Safety and Content Moderation(https://arxiv.org/abs/2512.21107)
Keywords: language model, llm, prompt
Abstract: Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.
摘要：大型语言模型 (LLM) 的安全性自出现以来一直是一个持续的研究焦点，并且随着这些模型容量的不断增加，其重要性在当今更加重要。目前，所有公共法学硕士都设有几个护栏，并且有多个用于训练安全分类器的拟议数据集。然而，训练这些安全分类器依赖于大量标记数据，这些数据获取起来可能存在问题，容易出现标记错误，或者通常包含合成数据。为了解决这些问题，我们建议采用不同的方法：利用半监督学习技术，利用标记和未标记的数据来提高安全任务的性能。我们分析了这些技术可以为大型语言模型的提示和对这些请求的响应提供的改进。此外，由于增强是半监督算法的核心部分，因此我们证明了使用特定于任务的增强的重要性，与通用增强技术相比，它显着提高了性能。

Title: ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Authors: Sichun Luo, Yi Huang, Mukai Li, Shichang Meng, Fengyuan Liu, Zefa Hu, Junlan Feng, Qi Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.21120
Pdf URL: https://arxiv.org/pdf/2512.21120
Copy Paste: [[2512.21120]] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models(https://arxiv.org/abs/2512.21120)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.
摘要：大型语言模型 (LLM) 越来越多地被部署为开放域、多轮环境中的对话助手，其中用户经常提供不完整或模糊的信息。然而，现有的以法学硕士为重点的澄清基准主要假设单轮交互或合作用户，限制了他们在现实环境中评估澄清行为的能力。我们引入了 \textbf{ClarifyMT-Bench}，这是一个基于五维模糊分类法和一组六个行为多样化的模拟用户角色的多轮澄清基准。通过 LLM 与人类的混合管道，我们构建了 6,120 个多轮对话，捕获了不同的歧义源和交互模式。对十个具有代表性的法学硕士的评估发现了一贯的澄清不足偏见：法学硕士往往过早回答，并且随着对话深度的增加，表现会下降。为了缓解这个问题，我们提出了 \textbf{ClarifyAgent}，这是一种代理方法，将澄清分解为感知、预测、跟踪和规划，从而大大提高了在模糊条件下的鲁棒性。 ClarifyMT-Bench 为研究法学硕士何时应提问、何时应回答以及如何解决现实世界中人与法学硕士互动中的歧义奠定了可重复的基础。

Title: SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

Authors: Mahi Luthra, Jiayi Shen, Maxime Poli, Angelo Ortiz, Yosuke Higuchi, Youssef Benchekroun, Martin Gleize, Charles-Eric Saint-James, Dongyan Lin, Phillip Rust, Angel Villar, Surya Parimi, Vanessa Stark, Rashel Moritz, Juan Pino, Yann LeCun, Emmanuel Dupoux
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21204
Pdf URL: https://arxiv.org/pdf/2512.21204
Copy Paste: [[2512.21204]] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation(https://arxiv.org/abs/2512.21204)
Keywords: language model
Abstract: Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at this https URL.
摘要：人类婴儿只需要几百小时的言语接触，就能习得新语言的基本单位，这凸显了与需要大量数据的自我监督言语模型相比，惊人的效率差距。为了解决这一差距，本文引入了 SpidR-Adapt，使用最少的未标记数据快速适应新语言。我们将这种低资源语音表示学习视为元学习问题，并构建了一个多任务自适应预训练（MAdaPT）协议，该协议将自适应过程制定为双层优化框架。为了在这个框架下实现可扩展的元训练，我们提出了一种新颖的启发式解决方案，即一阶双层优化（FOBLO），避免了大量的计算成本。最后，我们通过交替监督和监督目标的交错监督使用稳健的初始化来稳定元训练。根据经验，SpidR-Adapt 在音素辨别力 (ABX) 和口语建模（sWUGGY、sBLIMP、tSC）方面取得了快速进步，在不到 1 小时的目标语言音频训练后比域内语言模型有所改进，数据效率比标准训练高出 100 倍以上。这些发现强调了一条实用的、与架构无关的、通向受生物学启发、数据高效表示的道路。我们在此 https URL 开源训练代码和模型检查点。

Title: SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Authors: Divij Dudeja, Mayukha Pal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21280
Pdf URL: https://arxiv.org/pdf/2512.21280
Copy Paste: [[2512.21280]] SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance(https://arxiv.org/abs/2512.21280)
Keywords: language model, gpt, hallucination
Abstract: The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.
摘要：工程手册 (EM) 的用户发现阅读 EM 很困难，因为它们很长，格式很密集，其中包括书面文档、分步程序和工程设备的标准参数列表。现成的变压器，尤其是紧凑型变压器，将这种材料视为扁平的令牌流。这种方法会导致自信但不正确的数字答案，并迫使模型低效地记住单独的事实。 SMART（结构化记忆和推理变压器）为上述问题提供了一种不同且实用的解决方案。 SMART 使用分层方法构建其处理，并基于三个主要工作类别 (1) 语法感知事实提取器（语法）树 LSTM，从 EM 句子中提取事实作为主题关系对象关系 (2) 紧凑索引内存 MANN（内存增强神经网络），将这些理性主题关系对象索引为与信息源关联的 384 维向量，以及 (3) 学习的 6 层 Transformer将先前检索到的事实融合到其生成的响应中。整个SMART模型使用了45.51M参数，比GPT-2（124M）少了64%，比BERT（133M）少了69%，并且取得了比GPT-2高21.3%的准确率，表明SMART以最少的处理要求更好地拟合了数据。 SMART 采用双推理模式：已知文档的索引快速路径（亚秒级应答时间）和由 RAG 辅助的用于新上传的索引动态路径（FAISS Top 20 结果，内存在 64 个插槽处切断）。在现实世界的部署中，与类似的小型变压器模型相比，该框架可以提供更好的支持结果，并减少幻觉。

Title: Parallel Token Prediction for Language Models

Authors: Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.21323
Pdf URL: https://arxiv.org/pdf/2512.21323
Copy Paste: [[2512.21323]] Parallel Token Prediction for Language Models(https://arxiv.org/abs/2512.21323)
Keywords: language model
Abstract: We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
摘要：我们提出了并行令牌预测（PTP），这是一种用于语言模型中并行序列生成的通用框架。 PTP 通过将采样过程合并到模型中，在单个转换器调用中联合预测多个相关令牌。这减少了自回归解码的延迟瓶颈，并避免了现有多令牌预测方法中常见的限制性独立假设。我们证明 PTP 可以表示任意自回归序列分布。 PTP 可以通过提炼现有模型或在没有老师的情况下通过逆自回归训练来进行训练。实验上，我们通过在 Spec-Bench 上每步接受超过 4 个令牌，在 Vicuna-7B 上实现了最先进的推测解码性能。我们框架的通用性表明，在不损失建模能力的情况下，并行生成长序列是可行的。

Title: Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Authors: Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21329
Pdf URL: https://arxiv.org/pdf/2512.21329
Copy Paste: [[2512.21329]] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks(https://arxiv.org/abs/2512.21329)
Keywords: language model
Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.
摘要：抽象与推理语料库 (ARC) 和 ARC-AGI 等推理基准被广泛用于评估人工智能的进展，并且通常被解释为对核心、所谓“流体”推理能力的探索。尽管这些任务对人类来说显然很简单，但对于前沿视觉语言模型（VLM）来说仍然具有挑战性，这种差距通常归因于机器推理的缺陷。我们对这种解释提出质疑，并假设这种差距主要源于视觉感知的局限性，而不是归纳推理的缺陷。为了验证这个假设，我们引入了一个两阶段的实验流程，明确区分感知和推理。在感知阶段，每个图像被独立地转换为自然语言描述，而在推理阶段，模型使用这些描述归纳并应用规则。这种设计可以防止跨图像感应信号的泄漏，并将推理与感知瓶颈隔离开来。在 Mini-ARC、ACRE 和 Bongard-LOGO 这三个 ARC 风格的数据集上，我们通过将两阶段管道与标准的端到端一阶段评估进行比较，表明感知能力是观察到的性能差距的主导因素。对 VLM 输出中推理轨迹的手动检查进一步表明，大约 80% 的模型失败源于感知错误。总之，这些结果表明 ARC 风格的基准测试将感知和推理挑战混为一谈，并且观察到的性能差距可能夸大了机器推理的缺陷。我们的研究结果强调了在评估机器智能进展时需要将感知与推理分开的评估协议。

Title: C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Authors: Jin Qin, Zihan Liao, Ziyin Zhang, Hang Yu, Peng Di, Rui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21332
Pdf URL: https://arxiv.org/pdf/2512.21332
Copy Paste: [[2512.21332]] C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling(https://arxiv.org/abs/2512.21332)
Keywords: language model, llm
Abstract: We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
摘要：我们提出了 C2LLM - 对比代码大型语言模型，这是一系列 0.5B 和 7B 大小的代码嵌入模型。基于 Qwen-2.5-Coder 主干，C2LLM 采用多头注意力池（PMA）模块从 token 嵌入生成序列嵌入，有效地 1）利用预训练期间获得的 LLM 因果表示，同时 2）能够聚合序列中所有 token 的信息，打破基于 EOS 的序列嵌入的信息瓶颈，3）支持嵌入维度的灵活调整，作为 MRL 的替代品。经过 300 万个公开数据的训练，C2LLM 模型在类似规模的模型中创下了 MTEB-Code 的新记录，其中 C2LLM-7B 在整体排行榜上排名第一。