2026-02-11

Title: Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

Authors: Hossam Amer, Rezaul Karim, Ali Pourranjbar, Weiwei Zhang, Walid Ahmed, Boxing Chen
Subjects: cs.LG, cs.AI, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2602.09109
Pdf URL: https://arxiv.org/pdf/2602.09109
Copy Paste: [[2602.09109]] Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide(https://arxiv.org/abs/2602.09109)
Keywords: generation
Abstract: With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
摘要：随着大型语言模型 (LLM) 的快速增长，人们开发了多种方法来跨硬件设备分配计算和内存，以实现高效的训练和推理。虽然现有的调查提供了这些技术的描述性概述，但对其好处和权衡的系统分析以及这些见解如何为设计最佳分布式系统的原则方法提供信息仍然有限。本文对集体操作和分布式并行策略进行了全面回顾，并辅以数学公式来加深理论理解。我们进一步研究混合并行化设计，强调模型部署不同阶段（包括训练和推理）之间的通信计算重叠。还讨论了使用成本模型自动搜索最佳混合并行化策略的最新进展。此外，我们还提供了主流架构类别的案例研究，以揭示实证见解，以指导研究人员和从业者选择并行策略。最后，我们强调了当前法学硕士培训范式的开放挑战和局限性，并概述了下一代大规模模型开发的有希望的方向。

Title: Epistemic Throughput: Fundamental Limits of Attention-Constrained Inference

Authors: Lei You
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2602.09127
Pdf URL: https://arxiv.org/pdf/2602.09127
Copy Paste: [[2602.09127]] Epistemic Throughput: Fundamental Limits of Attention-Constrained Inference(https://arxiv.org/abs/2602.09127)
Keywords: generative
Abstract: Recent generative and tool-using AI systems can surface a large volume of candidates at low marginal cost, yet only a small fraction can be checked carefully. This creates a decoder-side bottleneck: downstream decision-makers must form reliable posteriors from many public records under scarce attention. We formalize this regime via Attention-Constrained Inference (ACI), in which a cheap screening stage processes $K$ records and an expensive verification stage can follow up on at most $B$ of them. Under Bayes log-loss, we study the maximum achievable reduction in posterior uncertainty per window, which we call \emph{epistemic throughput}. Our main result is a ``JaKoB'' scaling law showing that epistemic throughput has a baseline term that grows linearly with verification and prevalence, and an additional \emph{information-leverage} term that scales as $\sqrt{JKB}$, where $J$ summarizes screening quality. Thus, expanding cheap screening can nonlinearly amplify scarce verification, even when informative records are rare. We further show that this scaling is tight in a weak-screening limit, and that in the sparse-verification regime ($B \ll K$), substantial leverage requires heavy-tailed score distributions; for light-tailed scores the amplification is only logarithmic.
摘要：最近的生成和使用工具的人工智能系统可以以较低的边际成本呈现大量候选人，但只有一小部分可以仔细检查。这造成了解码器端的瓶颈：下游决策者必须在缺乏关注的情况下从许多公共记录中形成可靠的后验。我们通过注意力约束推理（ACI）形式化了这个机制，其中廉价的筛选阶段处理 $K$ 记录，而昂贵的验证阶段最多可以跟进其中的 $B$ 记录。在贝叶斯对数损失下，我们研究每个窗口后验不确定性可实现的最大减少，我们称之为\emph{认知吞吐量}。我们的主要结果是“JaKoB”标度定律，表明认知吞吐量有一个随着验证和流行度线性增长的基线项，以及一个额外的 \emph{信息杠杆} 项，其缩放比例为 $\sqrt{JKB}$，其中 $J$ 总结了筛选质量。因此，扩大廉价筛查可以非线性地放大稀缺的验证，即使信息记录很少。我们进一步表明，这种扩展在弱筛选限制下是严格的，并且在稀疏验证机制（$B \ll K$）中，实质性杠杆需要重尾分数分布；对于轻尾分数，放大只是对数的。

Title: Counterfactual Maps: What They Are and How to Find Them

Authors: Awa Khouna, Julien Ferry, Thibaut Vidal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09128
Pdf URL: https://arxiv.org/pdf/2602.09128
Copy Paste: [[2602.09128]] Counterfactual Maps: What They Are and How to Find Them(https://arxiv.org/abs/2602.09128)
Keywords: generation
Abstract: Counterfactual explanations are a central tool in interpretable machine learning, yet computing them exactly for complex models remains challenging. For tree ensembles, predictions are piecewise constant over a large collection of axis-aligned hyperrectangles, implying that an optimal counterfactual for a point corresponds to its projection onto the nearest rectangle with an alternative label under a chosen metric. Existing methods largely overlook this geometric structure, relying either on heuristics with no optimality guarantees or on mixed-integer programming formulations that do not scale to interactive use. In this work, we revisit counterfactual generation through the lens of nearest-region search and introduce counterfactual maps, a global representation of recourse for tree ensembles. Leveraging the fact that any tree ensemble can be compressed into an equivalent partition of labeled hyperrectangles, we cast counterfactual search as the problem of identifying the generalized Voronoi cell associated with the nearest rectangle of an alternative label. This leads to an exact, amortized algorithm based on volumetric k-dimensional (KD) trees, which performs branch-and-bound nearest-region queries with explicit optimality certificates and sublinear average query time after a one-time preprocessing phase. Our experimental analyses on several real datasets drawn from high-stakes application domains show that this approach delivers globally optimal counterfactual explanations with millisecond-level latency, achieving query times that are orders of magnitude faster than existing exact, cold-start optimization methods.
摘要：反事实解释是可解释机器学习的核心工具，但为复杂模型精确计算它们仍然具有挑战性。对于树集成，预测在大量轴对齐的超矩形上是分段常数，这意味着一个点的最佳反事实对应于它在所选度量下具有替代标签的最近矩形的投影。现有方法在很大程度上忽视了这种几何结构，要么依赖于没有最优性保证的启发式方法，要么依赖于不能扩展到交互式使用的混合整数编程公式。在这项工作中，我们通过最近区域搜索的视角重新审视反事实生成，并引入反事实地图，这是树集成资源的全局表示。利用任何树集成都可以压缩为标记超矩形的等效分区这一事实，我们将反事实搜索视为识别与替代标签的最近矩形相关的广义 Voronoi 单元的问题。这导致了一种基于体积 k 维 (KD) 树的精确摊销算法，该算法在一次性预处理阶段后使用显式最优性证书和亚线性平均查询时间执行分支定界最近区域查询。我们对来自高风险应用程序领域的几个真实数据集进行的实验分析表明，这种方法可以以毫秒级延迟提供全局最优的反事实解释，实现的查询时间比现有的精确冷启动优化方法快几个数量级。

Title: A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video

Authors: Andrea Filiberto Lucas, Dylan Seychell
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2602.09154
Pdf URL: https://arxiv.org/pdf/2602.09154
Copy Paste: [[2602.09154]] A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video(https://arxiv.org/abs/2602.09154)
Keywords: generative
Abstract: The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.
摘要：基于视频的新闻内容数量不断增加，提高了对透明、可靠的方法来提取屏幕信息的需求。然而，图形布局、排版约定和特定于平台的设计模式的可变性使得手动索引变得不切实际。这项工作提出了一个综合框架，用于从广播和社交媒体原生新闻视频中自动检测和提取个人姓名。它引入了一个精心策划且平衡的注释框架语料库，捕捉当代新闻图形的多样性，并提出了一种可解释的、模块化的提取管道，旨在在确定性和可审计的条件下运行。该管道是根据对比类的生成多模态方法进行评估的，揭示了确定性可审计性和随机推理之间的明确权衡。底层检测器达到 95.8% mAP@0.5，展示了图形元素本地化的稳健运行性能。虽然生成系统的原始准确率略高（F1：84.18% vs 77.08%），但它们缺乏新闻和分析环境所需的透明数据沿袭。拟议的管道可提供平衡的精度 (79.9%) 和召回率 (74.4%)，避免幻觉，并在每个处理阶段提供完整的可追溯性。补充用户调查结果表明，59% 的受访者表示在快节奏的广播中难以阅读屏幕上的名字，这强调了该任务的实际相关性。结果为现代新闻媒体中的混合多模态信息提取建立了方法上严格且可解释的基线。

Title: What do Geometric Hallucination Detection Metrics Actually Measure?

Authors: Eric Yeats, John Buckheit, Sarah Scullen, Brendan Kennedy, Loc Truong, Davis Brown, Bill Kay, Cliff Joslyn, Tegan Emerson, Michael J. Henry, John Emanuello, Henry Kvinge
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09158
Pdf URL: https://arxiv.org/pdf/2602.09158
Copy Paste: [[2602.09158]] What do Geometric Hallucination Detection Metrics Actually Measure?(https://arxiv.org/abs/2602.09158)
Keywords: generative
Abstract: Hallucination remains a barrier to deploying generative models in high-consequence applications. This is especially true in cases where external ground truth is not readily available to validate model outputs. This situation has motivated the study of geometric signals in the internal state of an LLM that are predictive of hallucination and require limited external knowledge. Given that there are a range of factors that can lead model output to be called a hallucination (e.g., irrelevance vs incoherence), in this paper we ask what specific properties of a hallucination these geometric statistics actually capture. To assess this, we generate a synthetic dataset which varies distinct properties of output associated with hallucination. This includes output correctness, confidence, relevance, coherence, and completeness. We find that different geometric statistics capture different types of hallucinations. Along the way we show that many existing geometric detection methods have substantial sensitivity to shifts in task domain (e.g., math questions vs. history questions). Motivated by this, we introduce a simple normalization method to mitigate the effect of domain shift on geometric statistics, leading to AUROC gains of +34 points in multi-domain settings.
摘要：幻觉仍然是在高后果应用中部署生成模型的障碍。在外部真实情况不易用于验证模型输出的情况下尤其如此。这种情况激发了法学硕士内部状态几何信号的研究，这些信号可以预测幻觉并且需要有限的外部知识。鉴于有一系列因素可能导致模型输出被称为幻觉（例如，不相关与不连贯），在本文中，我们询问这些几何统计数据实际上捕获了幻觉的哪些具体属性。为了评估这一点，我们生成了一个合成数据集，该数据集改变了与幻觉相关的输出的不同属性。这包括输出的正确性、置信度、相关性、连贯性和完整性。我们发现不同的几何统计数据捕捉到不同类型的幻觉。在此过程中，我们表明许多现有的几何检测方法对任务领域的变化（例如，数学问题与历史问题）具有很大的敏感性。受此启发，我们引入了一种简单的归一化方法来减轻域移位对几何统计的影响，从而在多域设置中获得+34点的AUROC增益。

Title: All-in-One Conditioning for Text-to-Image Synthesis

Authors: Hirunima Jayasekara, Chuong Huynh, Yixuan Ren, Christabel Acquaye, Abhinav Shrivastava
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09165
Pdf URL: https://arxiv.org/pdf/2602.09165
Copy Paste: [[2602.09165]] All-in-One Conditioning for Text-to-Image Synthesis(https://arxiv.org/abs/2602.09165)
Keywords: generation
Abstract: Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
摘要：涉及多个对象、属性和空间关系的复杂提示的准确解释和视觉表示是文本到图像合成的关键挑战。尽管最近在生成逼真输出方面取得了进展，但当前模型在处理复杂的文本输入时常常难以维持语义保真度和结构连贯性。我们提出了一种新颖的方法，将文本到图像的合成建立在场景图结构的框架内，旨在增强现有模型的组合能力。尽管先前的方法已经尝试通过使用从提示导出的预定义布局图来解决这个问题，但是这种严格的约束通常限制了构图的灵活性和多样性。相比之下，我们引入了一种零样本、基于场景图的条件机制，可以在推理过程中生成软视觉指导。我们方法的核心是属性-大小-数量-位置 (ASQL) 条件器，它通过轻量级语言模型生成视觉条件，并通过推理时间优化指导基于扩散的生成。这使得模型能够保持文本图像对齐，同时支持轻量级、连贯且多样化的图像合成。

Title: Gradient Residual Connections

Authors: Yangchen Pan, Qizhen Ying, Philip Torr, Bo Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09190
Pdf URL: https://arxiv.org/pdf/2602.09190
Copy Paste: [[2602.09190]] Gradient Residual Connections(https://arxiv.org/abs/2602.09190)
Keywords: super-resolution
Abstract: Existing work has linked properties of a function's gradient to the difficulty of function approximation. Motivated by these insights, we study how gradient information can be leveraged to improve neural network's ability to approximate high-frequency functions, and we propose a gradient-based residual connection as a complement to the standard identity skip connection used in residual networks. We provide simple theoretical intuition for why gradient information can help distinguish inputs and improve the approximation of functions with rapidly varying behaviour. On a synthetic regression task with a high-frequency sinusoidal ground truth, we show that conventional residual connections struggle to capture high-frequency patterns. In contrast, our gradient residual substantially improves approximation quality. We then introduce a convex combination of the standard and gradient residuals, allowing the network to flexibly control how strongly it relies on gradient information. After validating the design choices of our proposed method through an ablation study, we further validate our approach's utility on the single-image super-resolution task, where the underlying function may be high-frequency. Finally, on standard tasks such as image classification and segmentation, our method achieves performance comparable to standard residual networks, suggesting its broad utility.
摘要：现有的工作将函数梯度的属性与函数逼近的难度联系起来。受这些见解的启发，我们研究了如何利用梯度信息来提高神经网络逼近高频函数的能力，并提出基于梯度的残差连接作为残差网络中使用的标准身份跳跃连接的补充。我们提供了简单的理论直觉，解释为什么梯度信息可以帮助区分输入并改进具有快速变化行为的函数的近似。在具有高频正弦基本事实的综合回归任务中，我们表明传统的残差连接难以捕获高频模式。相比之下，我们的梯度残差大大提高了近似质量。然后，我们引入标准残差和梯度残差的凸组合，允许网络灵活控制其对梯度信息的依赖程度。通过消融研究验证了我们提出的方法的设计选择后，我们进一步验证了我们的方法在单图像超分辨率任务上的实用性，其中底层函数可能是高频的。最后，在图像分类和分割等标准任务上，我们的方法实现了与标准残差网络相当的性能，表明其广泛的实用性。

Title: Rethinking Global Text Conditioning in Diffusion Transformers

Authors: Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09268
Pdf URL: https://arxiv.org/pdf/2602.09268
Copy Paste: [[2602.09268]] Rethinking Global Text Conditioning in Diffusion Transformers(https://arxiv.org/abs/2602.09268)
Keywords: generation
Abstract: Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
摘要：扩散变压器通常通过注意力层和使用池文本嵌入的调制机制合并文本信息。然而，最近的方法放弃了基于调制的文本调节并完全依赖于注意力。在本文中，我们讨论基于调制的文本调节是否必要以及它是否可以提供任何性能优势。我们的分析表明，在常规用法中，池化嵌入对整体性能贡献不大，这表明仅注意通常足以忠实地传播提示信息。然而，我们发现，从不同的角度使用时，池化嵌入可以提供显着的收益——作为指导并实现向更理想的属性的可控转变。这种方法无需培训，易于实现，运行时开销可以忽略不计，并且可以应用于各种扩散模型，从而改进各种任务，包括文本到图像/视频生成和图像编辑。

Title: Measuring Privacy Risks and Tradeoffs in Financial Synthetic Data Generation

Authors: Michael Zuo, Inwon Kang, Stacy Patterson, Oshani Seneviratne
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09288
Pdf URL: https://arxiv.org/pdf/2602.09288
Copy Paste: [[2602.09288]] Measuring Privacy Risks and Tradeoffs in Financial Synthetic Data Generation(https://arxiv.org/abs/2602.09288)
Keywords: generation, generative
Abstract: We explore the privacy-utility tradeoff of synthetic data generation schemes on tabular financial datasets, a domain characterized by high regulatory risk and severe class imbalance. We consider representative tabular data generators, including autoencoders, generative adversarial networks, diffusion, and copula synthesizers. To address the challenges of the financial domain, we provide novel privacy-preserving implementations of GAN and autoencoder synthesizers. We evaluate whether and how well the generators simultaneously achieve data quality, downstream utility, and privacy, with comparison across balanced and imbalanced input datasets. Our results offer insight into the distinct challenges of generating synthetic data from datasets that exhibit severe class imbalance and mixed-type attributes.
摘要：我们探索表格金融数据集上合成数据生成方案的隐私与效用权衡，该领域的特点是高监管风险和严重的类别不平衡。我们考虑代表性的表格数据生成器，包括自动编码器、生成对抗网络、扩散和 copula 合成器。为了应对金融领域的挑战，我们提供了新颖的 GAN 和自动编码器合成器的隐私保护实现。我们通过平衡和不平衡输入数据集的比较来评估生成器是否以及如何同时实现数据质量、下游效用和隐私。我们的结果让我们深入了解从表现出严重类别不平衡和混合类型属性的数据集中生成合成数据的独特挑战。

Title: Stabilizing Physics-Informed Consistency Models via Structure-Preserving Training

Authors: Che-Chia Chang, Chen-Yang Dai, Te-Sheng Lin, Ming-Chih Lai, Chieh-Hsin Lai
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2602.09303
Pdf URL: https://arxiv.org/pdf/2602.09303
Copy Paste: [[2602.09303]] Stabilizing Physics-Informed Consistency Models via Structure-Preserving Training(https://arxiv.org/abs/2602.09303)
Keywords: generation, generative
Abstract: We propose a physics-informed consistency modeling framework for solving partial differential equations (PDEs) via fast, few-step generative inference. We identify a key stability challenge in physics-constrained consistency training, where PDE residuals can drive the model toward trivial or degenerate solutions, degrading the learned data distribution. To address this, we introduce a structure-preserving two-stage training strategy that decouples distribution learning from physics enforcement by freezing the coefficient decoder during physics-informed fine-tuning. We further propose a two-step residual objective that enforces physical consistency on refined, structurally valid generative trajectories rather than noisy single-step predictions. The resulting framework enables stable, high-fidelity inference for both unconditional generation and forward problems. We demonstrate that forward solutions can be obtained via a projection-based zero-shot inpainting procedure, achieving consistent accuracy of diffusion baselines with orders of magnitude reduction in computational cost.
摘要：我们提出了一种基于物理的一致性建模框架，用于通过快速、少步的生成推理来求解偏微分方程（PDE）。我们确定了物理约束一致性训练中的一个关键稳定性挑战，其中偏微分方程残差可以将模型推向平凡或退化的解决方案，从而降低学习到的数据分布。为了解决这个问题，我们引入了一种保留结构的两阶段训练策略，通过在物理信息微调期间冻结系数解码器，将分布学习与物理执行分离。我们进一步提出了一个两步残差目标，它增强了精炼的、结构上有效的生成轨迹的物理一致性，而不是嘈杂的单步预测。由此产生的框架能够对无条件生成和前向问题进行稳定、高保真的推理。我们证明，可以通过基于投影的零样本修复程序获得前向解决方案，实现扩散基线的一致精度，同时计算成本降低几个数量级。

Title: Empowering Contrastive Federated Sequential Recommendation with LLMs

Authors: Thi Minh Chau Nguyen, Minh Hieu Nguyen, Duc Anh Nguyen, Xuan Huong Tran, Thanh Trung Huynh, Quoc Viet Hung Nguyen
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2602.09306
Pdf URL: https://arxiv.org/pdf/2602.09306
Copy Paste: [[2602.09306]] Empowering Contrastive Federated Sequential Recommendation with LLMs(https://arxiv.org/abs/2602.09306)
Keywords: generation
Abstract: Federated sequential recommendation (FedSeqRec) aims to perform next-item prediction while keeping user data decentralised, yet model quality is frequently constrained by fragmented, noisy, and homogeneous interaction logs stored on individual devices. Many existing approaches attempt to compensate through manual data augmentation or additional server-side constraints, but these strategies either introduce limited semantic diversity or increase system overhead. To overcome these challenges, we propose \textbf{LUMOS}, a parameter-isolated FedSeqRec architecture that integrates large language models (LLMs) as \emph{local semantic generators}. Instead of sharing gradients or auxiliary parameters, LUMOS privately invokes an on-device LLM to construct three complementary sequence variants from each user history: (i) \emph{future-oriented} trajectories that infer plausible behavioural continuations, (ii) \emph{semantically equivalent rephrasings} that retain user intent while diversifying interaction patterns, and (iii) \emph{preference-inconsistent counterfactuals} that serve as informative negatives. These synthesized sequences are jointly encoded within the federated backbone through a tri-view contrastive optimisation scheme, enabling richer representation learning without exposing sensitive information. Experimental results across three public benchmarks show that LUMOS achieves consistent gains over competitive centralised and federated baselines on HR@20 and NDCG@20. In addition, the use of semantically grounded positive signals and counterfactual negatives improves robustness under noisy and adversarial environments, even without dedicated server-side protection modules. Overall, this work demonstrates the potential of LLM-driven semantic generation as a new paradigm for advancing privacy-preserving federated recommendation.
摘要：联合顺序推荐 (FedSeqRec) 旨在执行下一项预测，同时保持用户数据分散，但模型质量经常受到存储在各个设备上的碎片、噪声和同质交互日志的限制。许多现有方法试图通过手动数据增强或额外的服务器端约束来进行补偿，但这些策略要么引入有限的语义多样性，要么增加系统开销。为了克服这些挑战，我们提出了 \textbf{LUMOS}，一种参数隔离的 FedSeqRec 架构，它将大型语言模型（LLM）集成为 \emph{本地语义生成器}。 LUMOS 不共享梯度或辅助参数，而是私下调用设备上的 LLM 来根据每个用户历史构建三个互补的序列变体：(i) \emph{面向未来的} 轨迹，推断出合理的行为延续；(ii) \emph{语义等效的重新措辞}，保留用户意图，同时使交互模式多样化；(iii) \emph{偏好不一致的反事实} 提供信息负面的。这些合成序列通过三视图对比优化方案在联邦骨干网中联合编码，从而在不暴露敏感信息的情况下实现更丰富的表示学习。三个公共基准的实验结果表明，LUMOS 在 HR@20 和 NDCG@20 上比竞争性集中式和联合基准取得了一致的收益。此外，即使没有专用的服务器端保护模块，使用基于语义的积极信号和反事实消极信号也可以提高在嘈杂和对抗环境下的鲁棒性。总的来说，这项工作展示了法学硕士驱动的语义生成作为推进隐私保护联合推荐的新范式的潜力。

Title: Learning with Multiple Correct Answers -- A Trichotomy of Regret Bounds under Different Feedback Models

Authors: Alireza F. Pour, Farnam Mansouri, Shai Ben-David
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09402
Pdf URL: https://arxiv.org/pdf/2602.09402
Copy Paste: [[2602.09402]] Learning with Multiple Correct Answers -- A Trichotomy of Regret Bounds under Different Feedback Models(https://arxiv.org/abs/2602.09402)
Keywords: generation
Abstract: We study an online learning problem with multiple correct answers, where each instance admits a set of valid labels, and in each round the learner must output a valid label for the queried example. This setting is motivated by language generation tasks, in which a prompt may admit many acceptable completions, but not every completion is acceptable. We study this problem under three feedback models. For each model, we characterize the optimal mistake bound in the realizable setting using an appropriate combinatorial dimension. We then establish a trichotomy of regret bounds across the three models in the agnostic setting. Our results also imply sample complexity bounds for the batch setup that depend on the respective combinatorial dimensions.
摘要：我们研究一个具有多个正确答案的在线学习问题，其中每个实例都承认一组有效标签，并且在每一轮中学习者必须为查询的示例输出有效标签。此设置是由语言生成任务驱动的，其中提示可能会接受许多可接受的完成，但并非每个完成都是可接受的。我们在三种反馈模型下研究这个问题。对于每个模型，我们使用适当的组合维度来描述可实现设置中的最佳错误界限。然后，我们在不可知论环境中对三个模型建立了后悔界限的三分法。我们的结果还意味着批量设置的样本复杂性界限取决于各自的组合维度。

Title: K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge

Authors: Zhikai Li, Jiatong Li, Xuewen Liu, Wangbo Zhao, Pan Du, Kaicheng Zhou, Qingyi Gu, Yang You, Zhen Dong, Kurt Keutzer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09411
Pdf URL: https://arxiv.org/pdf/2602.09411
Copy Paste: [[2602.09411]] K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge(https://arxiv.org/abs/2602.09411)
Keywords: generation, generative
Abstract: The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.
摘要：视觉生成模型的快速发展提出了对更具可扩展性和人性化评估方法的需求。虽然众包 Arena 平台通过收集人类投票来提供人类偏好评估，但它们成本高昂且耗时，本质上限制了其可扩展性。利用视觉语言模型（VLM）替代人工判断是一种很有前景的解决方案。然而，VLM 固有的幻觉和偏差阻碍了与人类偏好的一致性，从而损害了评估的可靠性。此外，静态评估方法导致效率低下。在本文中，我们提出了 K-Sort Eval，这是一种可靠且高效的基于 VLM 的评估框架，集成了后验校正和动态匹配。具体来说，我们从 K-Sort Arena 中数千个人类投票中策划了一个高质量的数据集，每个实例都包含 K 个模型的输出和排名。在评估新模型时，它会与现有模型进行 (K+1) 方式的自由比较，然后 VLM 提供排名。为了增强对齐和可靠性，我们提出了一种后验校正方法，该方法根据 VLM 预测和人类监督之间的一致性自适应地校正贝叶斯更新中的后验概率。此外，我们提出了一种动态匹配策略，平衡不确定性和多样性，以最大化每次比较的预期收益，从而确保更有效的评估。大量实验表明，K-Sort Eval 提供的评估结果与 K-Sort Arena 一致，通常需要少于 90 次模型运行，证明了其效率和可靠性。

Title: Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

Authors: Prin Phunyaphibarn, Minhyuk Sung
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2602.09424
Pdf URL: https://arxiv.org/pdf/2602.09424
Copy Paste: [[2602.09424]] Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design(https://arxiv.org/abs/2602.09424)
Keywords: generation, generative
Abstract: Discrete diffusion models have recently emerged as a powerful class of generative models for chemistry and biology data. In these fields, the goal is to generate various samples with high rewards (e.g., drug-likeness in molecules), making reward-based guidance crucial. Most existing methods are based on guiding the diffusion model using intermediate rewards but tend to underperform since intermediate rewards are noisy due to the non-smooth nature of reward functions used in scientific domains. To address this, we propose Clean-Sample Markov Chain (CSMC) Sampler, a method that performs effective test-time reward-guided sampling for discrete diffusion models, enabling local search without relying on intermediate rewards. CSMC constructs a Markov chain of clean samples using the Metropolis-Hastings algorithm such that its stationary distribution is the target distribution. We design a proposal distribution by sequentially applying the forward and backward diffusion processes, making the acceptance probability tractable. Experiments on molecule and biological sequence generation with various reward functions demonstrate that our method consistently outperforms prior approaches that rely on intermediate rewards.
摘要：离散扩散模型最近已成为一类强大的化学和生物学数据生成模型。在这些领域，目标是生成具有高奖励的各种样本（例如分子中的药物相似性），因此基于奖励的指导至关重要。大多数现有方法都是基于使用中间奖励来指导扩散模型，但往往表现不佳，因为由于科学领域中使用的奖励函数的非平滑性质，中间奖励是嘈杂的。为了解决这个问题，我们提出了清洁样本马尔可夫链（CSMC）采样器，这是一种对离散扩散模型执行有效的测试时奖励引导采样的方法，无需依赖中间奖励即可实现本地搜索。 CSMC 使用 Metropolis-Hastings 算法构建干净样本的马尔可夫链，使其平稳分布作为目标分布。我们通过顺序应用前向和后向扩散过程来设计提案分布，使接受概率易于处理。具有各种奖励函数的分子和生物序列生成实验表明，我们的方法始终优于依赖中间奖励的先前方法。

Title: Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification

Authors: Yiqiao Li, Bo Shang, Jie Wei
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09425
Pdf URL: https://arxiv.org/pdf/2602.09425
Copy Paste: [[2602.09425]] Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification(https://arxiv.org/abs/2602.09425)
Keywords: generation
Abstract: Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
摘要：细粒度的卡车分类对于智能交通系统 (ITS) 至关重要，但当前基于 LiDAR 的方法由于依赖监督深度学习和劳动密集型手动注释而面临可扩展性挑战。视觉语言模型 (VLM) 提供了有希望的少样本泛化，但它们在路边 LiDAR 中的应用受到稀疏 3D 点云和密集 2D 图像之间模态差距的限制。我们提出了一个框架，通过采用现成的 VLM 来进行细粒度的卡车分类，而无需进行参数微调，从而弥补了这一差距。我们新的深度感知图像生成管道应用噪声去除、空间和时间配准、方向校正、形态操作和各向异性平滑来将稀疏、遮挡的 LiDAR 扫描转换为深度编码的 2D 视觉代理。我们的方法在包含 20 个车辆类别的真实数据集上进行了验证，每类只需 16-30 个示例即可实现有竞争力的分类准确性，为数据密集型监督基线提供了可扩展的替代方案。我们进一步观察到“语义锚”效应：基于文本的指导规范了超低镜头状态 $k < 4$ 中的性能，但由于语义不匹配而降低了更多镜头设置中的准确性。此外，我们还证明了该框架作为冷启动策略的有效性，使用 VLM 生成的标签来引导轻量级监督模型。值得注意的是，基于few-shot VLM的模型对特定拖运类别（20英尺、40英尺和53英尺集装箱）的正确分类率达到了75%以上，完全不需要昂贵的培训或微调，显着减少了初始手动标签的密集需求，从而实现了在ITS应用中实际使用的方法。

Title: SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

Authors: Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09432
Pdf URL: https://arxiv.org/pdf/2602.09432
Copy Paste: [[2602.09432]] SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL(https://arxiv.org/abs/2602.09432)
Keywords: generation
Abstract: Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
摘要：由于缺乏深思熟虑的推理，当前的一次性 3D 场景合成方法经常会出现空间幻觉，例如碰撞。为了弥补这一差距，我们引入了 SceneReVis，这是一个基于视觉的自我反思框架，它采用迭代的“诊断和行动”循环，使用多模态反馈来明确拦截和解决空间冲突。为了支持这种逐步范例，我们构建了 SceneChain-12k，这是一个通过新颖的逆向工程管道导出的因果构建轨迹的大型数据集。我们进一步提出了一个两阶段的训练方法，从监督微调过渡到代理强化学习，将模型演变成主动空间规划器。大量实验表明，SceneReVis 在高保真生成和面向目标的优化方面实现了最先进的性能，并对长尾域具有强大的泛化能力。

Title: Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

Authors: Xu Ma, Yitian Zhang, Qihua Dong, Yun Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09439
Pdf URL: https://arxiv.org/pdf/2602.09439
Copy Paste: [[2602.09439]] Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning(https://arxiv.org/abs/2602.09439)
Keywords: generation
Abstract: High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
摘要：高质量和开放的数据集仍然是文本到图像（T2I）微调的主要瓶颈。尽管模型架构和训练流程取得了快速进展，但大多数公开可用的微调数据集都存在分辨率低、文本图像对齐差或多样性有限的问题，导致开放研究模型和企业级模型之间存在明显的性能差距。在这项工作中，我们提出了 Fine-T2I，一个用于 T2I 微调的大规模、高质量、完全开放的数据集。 Fine-T2I跨越10种任务组合、32种提示类别、11种视觉风格和5种提示模板，并将强大的现代模型生成的合成图像与专业摄影师精心策划的真实图像相结合。所有样本都经过严格的文本图像对齐、视觉保真度和提示质量过滤，超过 95% 的初始候选样本被删除。最终数据集包含超过 600 万个文本图像对，磁盘大小约为 2 TB，接近预训练数据集的规模，同时保持微调级别的质量。在一系列不同的预训练扩散和自回归模型中，Fine-T2I 上的微调不断提高生成质量和指令依从性，这一点经过人工评估、视觉比较和自动指标的验证。我们在开放许可下发布 Fine-T2I，以帮助缩小开放社区中 T2I 微调的数据差距。

Title: Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing

Authors: Yan Luo, Henry Huang, Todd Y. Zhou, Mengyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09449
Pdf URL: https://arxiv.org/pdf/2602.09449
Copy Paste: [[2602.09449]] Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing(https://arxiv.org/abs/2602.09449)
Keywords: generation, generative
Abstract: Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.
摘要：最近的进展通过流匹配框架将扩散模型重新表述为确定性常微分方程（ODE），为噪声到数据的生成过程提供了统一的公式。人们已经开发了各种免训练的流量匹配方法，以通过流速场调整来改进图像生成，从而消除了昂贵的重新训练的需要。然而，修改速度场 $v$ 会引入通过整个生成路径传播的误差，而对潜在轨迹 $z$ 的调整自然会由预训练的速度网络进行纠正，从而减少误差累积。在本文中，我们提出了两种基于未来和过去速度 $v$ 和潜在轨迹 $z$ 信息的互补的免训练潜在轨迹调整方法，直接在潜在空间中细化生成路径。我们提出了两种免训练的轨迹平滑方案：\emph{Look-Ahead}，它使用曲率门控权重对当前和下一步潜在变量进行平均；以及 \emph{Look-Back}，它使用带有衰减的指数移动平均值来平滑潜在变量。我们通过广泛的实验和综合评估指标证明，所提出的免训练轨迹平滑模型在多个数据集（包括 COCO17、CUB-200 和 Flickr30K）上明显优于各种最先进的模型。

Title: FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation

Authors: Chuanhai Zang, Jiabao Hu, XW Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09476
Pdf URL: https://arxiv.org/pdf/2602.09476
Copy Paste: [[2602.09476]] FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation(https://arxiv.org/abs/2602.09476)
Keywords: generation
Abstract: Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.
摘要：合成数据为几何敏感的视觉任务提供了低成本、准确注释的样本，但合成域和真实域之间的外观和成像差异会导致严重的域偏移并降低下游性能。不成对的合成到真实的翻译可以在没有成对监督的情况下缩小这种差距，但现有的方法通常面临照片真实性和结构稳定性之间的权衡：无约束的生成可能会引入变形或虚假纹理，而过于严格的约束限制了对实域统计数据的适应。我们提出了 FD-DB，一种频率解耦双分支模型，将外观传输分离为低频可解释编辑和高频残差补偿。可解释分支预测物理上有意义的编辑参数（白平衡、曝光、对比度、饱和度、模糊和颗粒），以构建具有强大内容保留的稳定低频外观基础。自由分支通过残差生成补充精细细节，门控融合机制在明确的频率约束下组合两个分支以限制低频漂移。我们进一步采用两阶段训练计划，首先稳定编辑分支，然后释放残余分支以提高优化稳定性。在 YCB-V 数据集上的实验表明，FD-DB 提高了实域外观一致性，并显着提高了下游语义分割性能，同时保留了几何和语义结构。

Title: Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Authors: Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao, Zili Wang, Wenxuan Guo, Ying Wang, Kaiyuan Zheng, Bo Zhang, Zhe Li, Shiming Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09483
Pdf URL: https://arxiv.org/pdf/2602.09483
Copy Paste: [[2602.09483]] Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions(https://arxiv.org/abs/2602.09483)
Keywords: generation, generative
Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at this https URL.
摘要：多模态大型语言模型 (MLLM) 展示了令人印象深刻的跨模态功能，但其庞大的规模带来了重大的部署挑战。知识蒸馏（KD）是压缩这些模型的一种有前景的解决方案，但现有方法主要依赖于静态下一个令牌对齐，忽略了动态令牌交互，而动态令牌交互嵌入了多模态理解和生成的基本功能。为此，我们引入Align-TI，一个从Token交互角度设计的新颖的KD框架。我们的方法的动机是认识到 MLLM 依赖于两种主要交互：用于提取相关视觉信息的视觉指令令牌交互，以及用于连贯生成的响应内令牌交互。因此，Align-TI 引入了两个组件：IVA 使学生模型能够通过对齐显着视觉区域来模仿教师的与教学相关的视觉信息提取能力。 TPA 通过调整顺序标记到标记的转换概率来捕获教师的动态生成逻辑。大量的实验证明了Align-TI 的优越性。值得注意的是，我们的方法比 Vanilla KD 实现了 2.6\%$ 的相对改进，并且我们的蒸馏 Align-TI-2B 甚至比 LLaVA-1.5-7B（更大的 MLLM）高出 7.0\%$，为训练参数高效的 MLLM 建立了一个新的最先进的蒸馏框架。代码可从此 https URL 获取。

Title: Towards Uniformity and Alignment for Multimodal Representation Learning

Authors: Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09507
Pdf URL: https://arxiv.org/pdf/2602.09507
Copy Paste: [[2602.09507]] Towards Uniformity and Alignment for Multimodal Representation Learning(https://arxiv.org/abs/2602.09507)
Keywords: generation, generative
Abstract: Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
摘要：多模态表示学习旨在构建一个共享的嵌入空间，其中异构模态在语义上对齐。尽管有强有力的实证结果，但基于 InfoNCE 的目标引入了固有的冲突，从而产生了跨模式的分配差距。在这项工作中，我们确定了多模态体系中的两个冲突，这两种冲突都随着模态数量的增加而加剧：（i）对齐均匀性冲突，均匀性的排斥破坏了成对对齐，以及（ii）内部对齐冲突，其中对齐多种模态导致竞争对齐方向。为了解决这些问题，我们提出了多模态表示的对齐和均匀性的原则性解耦，为多模态学习提供无冲突的方法，同时支持判别性和生成性用例，而无需特定于任务的模块。然后，我们提供了一个理论保证，即我们的方法可以有效地代表多个模态分布上的全局 Hölder 散度，从而减少模态之间的分布差距。关于检索和 UnCLIP 式生成的广泛实验证明了一致的收益。

Title: Robust Depth Super-Resolution via Adaptive Diffusion Sampling

Authors: Kun Wang, Yun Zhu, Pan Zhou, Na Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09510
Pdf URL: https://arxiv.org/pdf/2602.09510
Copy Paste: [[2602.09510]] Robust Depth Super-Resolution via Adaptive Diffusion Sampling(https://arxiv.org/abs/2602.09510)
Keywords: super-resolution, generative
Abstract: We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.
摘要：我们提出了 AdaDS，这是一种用于深度超分辨率的通用框架，可以从任意降级的低分辨率输入中稳健地恢复高分辨率深度图。与直接回归深度值并经常在严重或未知退化下表现出伪影的传统方法不同，AdaDS 利用高斯平滑的收缩特性：随着噪声在前向过程中累积，退化输入与其原始高质量对应输入之间的分布差异会减小，最终收敛到各向同性高斯先验。利用这一点，AdaDS 根据估计的细化不确定性自适应地选择反向扩散轨迹中的起始时间步，然后注入定制噪声以将中间样本定位在目标后验分布的高概率区域内。该策略确保了固有的鲁棒性，即使上游估计不完善，预训练扩散模型的生成先验也能主导恢复。对现实世界和综合基准的大量实验证明，与最先进的方法相比，AdaDS 具有卓越的零样本泛化能力和对不同退化模式的适应能力。

Title: SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

Authors: Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09528
Pdf URL: https://arxiv.org/pdf/2602.09528
Copy Paste: [[2602.09528]] SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem(https://arxiv.org/abs/2602.09528)
Keywords: generation
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.
摘要：多模态大语言模型 (MLLM) 的最新进展在各个领域取得了巨大的成功。然而，由于持续的幻觉，它们在医疗保健等高风险领域的使用仍然受到限制，其中生成的文本与视觉输入相矛盾或忽略。我们认为 MLLM 可以理解图像，但难以生成准确的标记序列。微小的扰动可能会将注意力从真实状态转移到不真实状态，而文本生成的自回归性质通常会阻止错误纠正。为了解决这个问题，我们提出了 SchröMind——一种通过解决薛定谔桥问题来减少幻觉的新颖框架。它通过轻量级训练以最小的传输成本在幻觉和真实激活之间建立了令牌级映射，同时保留了模型的原始功能。 POPE 和 MME 基准的大量实验证明了薛定谔的优越性，它实现了最先进的性能，同时只引入了最小的计算开销。

Title: DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment

Authors: Bohan Fu, Guanyi Qin, Fazhan Zhang, Zihao Huang, Mingxuan Li, Runze Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09531
Pdf URL: https://arxiv.org/pdf/2602.09531
Copy Paste: [[2602.09531]] DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment(https://arxiv.org/abs/2602.09531)
Keywords: quality assessment
Abstract: Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce this http URL, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. this http URL begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of this http URL over current methods and showcase its excellence in terms of generalization and data efficiency.
摘要：盲图像质量评估旨在在没有参考的情况下复制人类对视觉质量的感知，在视觉任务中发挥着关键作用，但现有模型往往无法有效捕获微妙的失真线索，导致与人类主观判断的不一致。我们发现这种限制的根本原因在于缺乏可靠的失真先验，因为方法通常学习统一图像特征和质量分数之间的浅层关系，导致它们对失真不敏感，从而限制了它们的性能。为了解决这个问题，我们引入了这个 http URL，这是一种新颖的先验驱动的 BIQA 框架，旨在明确合并失真先验，从而实现可靠的质量评估。该http URL 首先利用退化感知视觉语言模型来获取特定于失真的先验，并通过提出的失真显着性差分模块将其与语义注意区分开来进一步细化和增强，从而确保失真的真实表示。然后，经过改进的先验以及语义和桥接表示，由提出的名为动态失真加权模块的专家混合风格模块进行融合。该机制根据每个失真特定特征的感知影响对其进行加权，确保最终的质量预测与人类感知一致。在五个具有挑战性的 BIQA 基准上进行的大量实验证明了该 http URL 相对于当前方法的优越性，并展示了其在泛化和数据效率方面的卓越性能。

Title: AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Authors: Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu, Zhenglin Zhou, Xiaobo Xia, Jian Xue, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09534
Pdf URL: https://arxiv.org/pdf/2602.09534
Copy Paste: [[2602.09534]] AUHead: Realistic Emotional Talking Head Generation via Action Units Control(https://arxiv.org/abs/2602.09534)
Keywords: generation
Abstract: Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at this https URL
摘要：逼真的头部说话视频生成对于虚拟化身、电影制作和交互系统至关重要。由于缺乏细粒度的情绪控制，当前的方法难以处理微妙的情绪表达。为了解决这个问题，我们引入了一种新颖的两阶段方法（AUHead）来从音频中分离出细粒度的情感控制，即动作单元（AU），并实现可控生成。在第一阶段，我们通过时空AU标记化和“情感然后AU”的思想链机制来探索大型音频语言模型（ALM）的AU生成能力。它的目的是将 AU 与原始语音分开，有效捕捉微妙的情感线索。在第二阶段，我们提出了一种由 AU 驱动的可控扩散模型，该模型可以合成以 AU 序列为条件的逼真的头部说话视频。具体来说，我们首先将 AU 序列映射到结构化 2D 面部表示中以增强空间保真度，然后在交叉注意模块内对 AU 视觉交互进行建模。为了实现灵活的 AU 质量权衡控制，我们在推理过程中引入了 AU 解纠缠指导策略，进一步细化生成视频的情感表达和身份一致性。基准数据集的结果表明，我们的方法在情感真实性、准确的口型同步和视觉连贯性方面实现了竞争性能，显着超越了现有技术。我们的实现可通过此 https URL 获取

Title: ECG-IMN: Interpretable Mesomorphic Neural Networks for 12-Lead Electrocardiogram Interpretation

Authors: Vajira Thambawita, Jonas L. Isaksen, Jørgen K. Kanters, Hugo L. Hammer, Pål Halvorsen
Subjects: cs.LG, cs.AI, cs.CV, stat.ME
Abstract URL: https://arxiv.org/abs/2602.09566
Pdf URL: https://arxiv.org/pdf/2602.09566
Copy Paste: [[2602.09566]] ECG-IMN: Interpretable Mesomorphic Neural Networks for 12-Lead Electrocardiogram Interpretation(https://arxiv.org/abs/2602.09566)
Keywords: generation
Abstract: Deep learning has achieved expert-level performance in automated electrocardiogram (ECG) diagnosis, yet the "black-box" nature of these models hinders their clinical deployment. Trust in medical AI requires not just high accuracy but also transparency regarding the specific physiological features driving predictions. Existing explainability methods for ECGs typically rely on post-hoc approximations (e.g., Grad-CAM and SHAP), which can be unstable, computationally expensive, and unfaithful to the model's actual decision-making process. In this work, we propose the ECG-IMN, an Interpretable Mesomorphic Neural Network tailored for high-resolution 12-lead ECG classification. Unlike standard classifiers, the ECG-IMN functions as a hypernetwork: a deep convolutional backbone generates the parameters of a strictly linear model specific to each input sample. This architecture enforces intrinsic interpretability, as the decision logic is mathematically transparent and the generated weights (W) serve as exact, high-resolution feature attribution maps. We introduce a transition decoder that effectively maps latent features to sample-wise weights, enabling precise localization of pathological evidence (e.g., ST-elevation, T-wave inversion) in both time and lead dimensions. We evaluate our approach on the PTB-XL dataset for classification tasks, demonstrating that the ECG-IMN achieves competitive predictive performance (AUROC comparable to black-box baselines) while providing faithful, instance-specific explanations. By explicitly decoupling parameter generation from prediction execution, our framework bridges the gap between deep learning capability and clinical trustworthiness, offering a principled path toward "white-box" cardiac diagnostics.
摘要：深度学习在自动心电图（ECG）诊断方面已经取得了专家级的性能，但这些模型的“黑匣子”性质阻碍了它们的临床部署。对医疗人工智能的信任不仅需要高精度，还需要驱动预测的特定生理特征的透明度。现有的心电图可解释性方法通常依赖于事后近似（例如 Grad-CAM 和 SHAP），这种近似可能不稳定、计算成本高且不忠实于模型的实际决策过程。在这项工作中，我们提出了 ECG-IMN，一种专为高分辨率 12 导联心电图分类而定制的可解释介观神经网络。与标准分类器不同，ECG-IMN 充当超网络：深度卷积主干生成特定于每个输入样本的严格线性模型的参数。该架构强制执行内在的可解释性，因为决策逻辑在数学上是透明的，并且生成的权重 (W) 充当精确的高分辨率特征归因图。我们引入了一种转换解码器，可以有效地将潜在特征映射到样本权重，从而能够在时间和导联维度上精确定位病理证据（例如 ST 抬高、T 波反演）。我们评估了我们在 PTB-XL 数据集上用于分类任务的方法，证明 ECG-IMN 实现了有竞争力的预测性能（AUROC 与黑盒基线相当），同时提供了忠实的、特定于实例的解释。通过明确地将参数生成与预测执行解耦，我们的框架弥合了深度学习能力和临床可信度之间的差距，为“白盒”心脏诊断提供了一条原则性路径。

Title: Mitigating the Likelihood Paradox in Flow-based OOD Detection via Entropy Manipulation

Authors: Donghwan Kim, Hyunsoo Yoon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09581
Pdf URL: https://arxiv.org/pdf/2602.09581
Copy Paste: [[2602.09581]] Mitigating the Likelihood Paradox in Flow-based OOD Detection via Entropy Manipulation(https://arxiv.org/abs/2602.09581)
Keywords: generative
Abstract: Deep generative models that can tractably compute input likelihoods, including normalizing flows, often assign unexpectedly high likelihoods to out-of-distribution (OOD) inputs. We mitigate this likelihood paradox by manipulating input entropy based on semantic similarity, applying stronger perturbations to inputs that are less similar to an in-distribution memory bank. We provide a theoretical analysis showing that entropy control increases the expected log-likelihood gap between in-distribution and OOD samples in favor of the in-distribution, and we explain why the procedure works without any additional training of the density model. We then evaluate our method against likelihood-based OOD detectors on standard benchmarks and find consistent AUROC improvements over baselines, supporting our explanation.
摘要：深度生成模型可以轻松地计算输入可能性（包括标准化流），通常会将意外的高可能性分配给分布外（OOD）输入。我们通过基于语义相似性操纵输入熵，对与分布内存库不太相似的输入应用更强的扰动，来缓解这种似然悖论。我们提供的理论分析表明，熵控制增加了分布内样本和 OOD 样本之间的预期对数似然差距，有利于分布内分布，并且我们解释了为什么该过程无需对密度模型进行任何额外训练即可工作。然后，我们在标准基准上针对基于可能性的 OOD 检测器评估我们的方法，并发现相对于基线一致的 AUROC 改进，支持我们的解释。

Title: MieDB-100k: A Comprehensive Dataset for Medical Image Editing

Authors: Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo, Fan Wang, Bohan Zhuang, Shenda Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09587
Pdf URL: https://arxiv.org/pdf/2602.09587
Copy Paste: [[2602.09587]] MieDB-100k: A Comprehensive Dataset for Medical Image Editing(https://arxiv.org/abs/2602.09587)
Keywords: generation, generative
Abstract: The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
摘要：高质量数据的稀缺仍然是采用多模态生成模型进行医学图像编辑的主要瓶颈。现有的医学图像编辑数据集通常面临多样性有限、忽视医学图像理解以及无法平衡质量与可扩展性的问题。为了解决这些差距，我们提出了 MieDB-100k，这是一个用于文本引导医学图像编辑的大规模、高质量和多样化的数据集。它将编辑任务分为感知、修改和转换的视角，同时考虑理解和生成能力。我们通过数据管理管道构建 MieDB-100k，利用特定模态的专家模型和基于规则的数据合成方法，然后进行严格的手动检查以确保临床保真度。大量实验表明，使用 MieDB-100k 训练的模型始终优于开源和专有模型，同时表现出强大的泛化能力。我们预计该数据集将成为专业医学图像编辑未来进步的基石。

Title: Why the Counterintuitive Phenomenon of Likelihood Rarely Appears in Tabular Anomaly Detection with Deep Generative Models?

Authors: Donghwan Kim, Junghun Phee, Hyunsoo Yoon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09593
Pdf URL: https://arxiv.org/pdf/2602.09593
Copy Paste: [[2602.09593]] Why the Counterintuitive Phenomenon of Likelihood Rarely Appears in Tabular Anomaly Detection with Deep Generative Models?(https://arxiv.org/abs/2602.09593)
Keywords: generative
Abstract: Deep generative models with tractable and analytically computable likelihoods, exemplified by normalizing flows, offer an effective basis for anomaly detection through likelihood-based scoring. We demonstrate that, unlike in the image domain where deep generative models frequently assign higher likelihoods to anomalous data, such counterintuitive behavior occurs far less often in tabular settings. We first introduce a domain-agnostic formulation that enables consistent detection and evaluation of the counterintuitive phenomenon, addressing the absence of precise definition. Through extensive experiments on 47 tabular datasets and 10 CV/NLP embedding datasets in ADBench, benchmarked against 13 baseline models, we demonstrate that the phenomenon, as defined, is consistently rare in general tabular data. We further investigate this phenomenon from both theoretical and empirical perspectives, focusing on the roles of data dimensionality and difference in feature correlation. Our results suggest that likelihood-only detection with normalizing flows offers a practical and reliable approach for anomaly detection in tabular domains.
摘要：具有易处理且可分析计算的可能性的深度生成模型（以标准化流为例）为通过基于可能性的评分进行异常检测提供了有效的基础。我们证明，与图像领域中深度生成模型经常为异常数据分配更高的可能性不同，这种违反直觉的行为在表格设置中发生的频率要少得多。我们首先引入一种与领域无关的公式，可以对反直觉现象进行一致的检测和评估，解决缺乏精确定义的问题。通过对 ADBench 中的 47 个表格数据集和 10 个 CV/NLP 嵌入数据集进行大量实验，并以 13 个基线模型为基准，我们证明了这种现象（如定义）在一般表格数据中始终很少见。我们从理论和实证角度进一步研究了这一现象，重点关注数据维度和特征相关性差异的作用。我们的结果表明，使用归一化流的仅似然检测为表格域中的异常检测提供了一种实用且可靠的方法。

Title: Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures

Authors: Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09600
Pdf URL: https://arxiv.org/pdf/2602.09600
Copy Paste: [[2602.09600]] Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures(https://arxiv.org/abs/2602.09600)
Keywords: generation
Abstract: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
摘要：以自我为中心的交互式世界模型对于增强现实和嵌入式人工智能至关重要，其中视觉生成必须以低延迟、几何一致性和长期稳定性响应用户输入。我们研究自由空间手势下单个场景图像的以自我为中心的交互生成，旨在合成逼真的视频，其中手进入场景，与物体交互，并在头部运动下诱导可信的世界动态。这种设置带来了根本性的挑战，包括自由空间手势和大量接触训练数据之间的分布变化、单目视图中手部运动和相机运动之间的模糊性，以及任意长度视频生成的需要。我们提出了 Hand2World，这是一个统一的自回归框架，它通过基于投影 3D 手部网格的遮挡不变手调节来解决这些挑战，允许从场景上下文中推断可见性和遮挡，而不是在控制信号中进行编码。为了稳定以自我为中心的视点变化，我们通过每像素 Plücker 射线嵌入注入显式相机几何形状，将相机运动与手部运动分开并防止背景漂移。我们进一步开发了一个全自动的单目注释管道，并将双向扩散模型提炼成因果生成器，从而实现任意长度的合成。对三个以自我为中心的交互基准进行的实验表明，感知质量和 3D 一致性得到了显着改善，同时支持相机控制和长视距交互生成。

Title: Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Authors: Jialun Liu, Yukuo Ma, Xiao Cao, Tian Li, Gonghu Shang, Haibin Huang, Chi Zhang, Xuelong Li, Cong Liu, Junqi Liu, Jiakui Hu, Robby T. Tan, Shiwen Zhang, Liying Yang, Xiaoyan Yang, Qizhen Weng, Xiangzhen Chang, Yuanzhi Liang, Yifan Xu, Zhiyong Huang, Zuoxin Li, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09609
Pdf URL: https://arxiv.org/pdf/2602.09609
Copy Paste: [[2602.09609]] Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing(https://arxiv.org/abs/2602.09609)
Keywords: generation
Abstract: Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.
摘要：基于扩散的视频生成的最新进展极大地提高了视觉保真度和时间连贯性。然而，大多数现有方法仍然是特定于任务的，并且主要依赖于文本指令，限制了它们在统一框架内处理多模式输入、上下文参考以及不同视频生成和编辑场景的能力。此外，许多视频编辑方法依赖于针对单独操作精心设计的管道，这阻碍了可扩展性和可组合性。在本文中，我们提出了 Tele-Omni，这是一种用于视频生成和编辑的统一多模式框架，它遵循单一模型中的多模式指令，包括文本、图像和参考视频。 Tele-Omni 利用预训练的多模态大语言模型来解析异构指令并推断结构化生成或编辑意图，而基于扩散的生成器则根据这些结构化信号执行高质量视频合成。为了实现跨异构视频任务的联合训练，我们引入了任务感知数据处理管道，它将多模态输入统一为结构化指令格式，同时保留特定于任务的约束。 Tele-Omni 支持各种以视频为中心的任务，包括文本到视频生成、图像到视频生成、首尾帧视频生成、上下文视频生成和上下文视频编辑。通过将指令解析与视频合成解耦并将其与任务感知数据设计相结合，Tele-Omni 实现了灵活的多模式控制，同时保持了强大的时间连贯性和视觉一致性。实验结果表明，Tele-Omni 在多项任务中实现了具有竞争力的性能。

Title: AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models

Authors: Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2602.09611
Pdf URL: https://arxiv.org/pdf/2602.09611
Copy Paste: [[2602.09611]] AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models(https://arxiv.org/abs/2602.09611)
Keywords: generation
Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36\% AUC) and robust attack resilience (at least 88.61\% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
摘要：水印已成为大型视觉语言模型 (LVLM) 中内容可追溯性和知识产权保护的关键解决方案。然而，与视觉无关的水印可能会引入视觉上不相关的标记，并通过强制执行不加区别的伪随机偏差来破坏视觉基础。此外，当前的视觉特定水印依赖于视觉临界权重的静态一次性估计，并且在确定受保护令牌的比例时忽略权重分布密度。这种设计未能考虑生成过程中视觉依赖性的动态变化，并且可能会在长尾中引入低质量的令牌。为了应对这些挑战，我们提出了注意力引导动态水印（AGMark），这是一种新颖的框架，可以嵌入可检测信号，同时严格保持视觉保真度。在每个解码步骤中，AGMark 首先根据视觉相关性的注意力权重以及上下文感知的连贯性线索动态识别语义关键证据，从而产生更具适应性和校准良好的证据权重分布。然后，它通过联合考虑不确定性意识（令牌熵）和证据校准（权重密度）来确定语义关键令牌的比例，从而实现自适应词汇划分以避免不相关的令牌。实证结果证实，AGMark 优于传统方法，显着提高了生成质量，并在生成后期阶段的视觉语义保真度方面取得了特别强劲的成果。该框架在不牺牲推理效率的情况下，保持了极具竞争力的检测精度（至少99.36％AUC）和强大的攻击弹性（至少88.61％AUC），有效地建立了保留可靠性的多模态水印的新标准。

Title: Blind denoising diffusion models and the blessings of dimensionality

Authors: Zahra Kadkhodaie, Aram-Alexandre Pooladian, Sinho Chewi, Eero Simoncelli
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2602.09639
Pdf URL: https://arxiv.org/pdf/2602.09639
Copy Paste: [[2602.09639]] Blind denoising diffusion models and the blessings of dimensionality(https://arxiv.org/abs/2602.09639)
Keywords: generative
Abstract: We analyze, theoretically and empirically, the performance of generative diffusion models based on \emph{blind denoisers}, in which the denoiser is not given the noise amplitude in either the training or sampling processes. Assuming that the data distribution has low intrinsic dimensionality, we prove that blind denoising diffusion models (BDDMs), despite not having access to the noise amplitude, \emph{automatically} track a particular \emph{implicit} noise schedule along the reverse process. Our analysis shows that BDDMs can accurately sample from the data distribution in polynomially many steps as a function of the intrinsic dimension. Empirical results corroborate these mathematical findings on both synthetic and image data, demonstrating that the noise variance is accurately estimated from the noisy image. Remarkably, we observe that schedule-free BDDMs produce samples of higher quality compared to their non-blind counterparts. We provide evidence that this performance gain arises because BDDMs correct the mismatch between the true residual noise (of the image) and the noise assumed by the schedule used in non-blind diffusion models.
摘要：我们从理论上和经验上分析了基于\emph{盲降噪器}的生成扩散模型的性能，其中降噪器在训练或采样过程中都没有给出噪声幅度。假设数据分布具有较低的内在维度，我们证明盲降噪扩散模型（BDDM）尽管无法获取噪声幅度，但仍会沿着相反的过程自动跟踪特定的隐式噪声时间表。我们的分析表明，BDDM 可以根据内在维度以多项式多个步骤从数据分布中准确采样。经验结果证实了合成数据和图像数据上的这些数学发现，表明噪声方差是根据噪声图像准确估计的。值得注意的是，我们观察到与非盲同行相比，无时间表 BDDM 产生的样本质量更高。我们提供的证据表明，这种性能增益的出现是因为 BDDM 纠正了（图像的）真实残留噪声与非盲扩散模型中使用的时间表假设的噪声之间的不匹配。

Title: TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

Authors: Deyang Jiang, Jing Huang, Xuanle Zhao, Lei Chen, Liming Zheng, Fanfan Liu, Haibo Qiu, Peng Shi, Zhixiong Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09662
Pdf URL: https://arxiv.org/pdf/2602.09662
Copy Paste: [[2602.09662]] TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution(https://arxiv.org/abs/2602.09662)
Keywords: generation
Abstract: Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at this https URL.
摘要：有效扩展 GUI 自动化对于计算机使用代理 (CUA) 至关重要；然而，现有的工作主要侧重于扩展 GUI 基础，而不是更重要的 GUI 规划，后者需要更复杂的数据收集。实际上，跨应用程序/桌面/网页的 CUA 探索过程通常遵循树形结构，早期的功能入口点通常会被更频繁地探索。因此，将大规模轨迹组织成树结构可以降低数据成本并简化 GUI 规划的数据扩展。在这项工作中，我们提出 TreeCUA 通过树形结构的可验证进化来有效地扩展 GUI 自动化。我们提出了一个多智能体协作框架来探索环境、验证动作、总结轨迹并评估质量，以生成高质量和可扩展的 GUI 轨迹。为了提高效率，我们设计了一种新颖的基于树的拓扑来存储和重放重复的探索节点，并设计了一种自适应探索算法来平衡深度（\emph{即，轨迹难度）和广度（\emph{即}，轨迹多样性）。此外，我们开发了世界知识指导和全局记忆回溯，以避免低质量的生成。最后，我们从丰富的树节点信息中自然地扩展和提出了TreeCUA-DPO方法，通过参考相邻轨迹的分支信息来提高GUI规划能力。实验结果表明，TreeCUA 和 TreeCUA-DPO 提供了显着的改进，域外（OOD）研究进一步证明了强大的泛化能力。所有轨迹节点信息和代码都可以在此 https URL 中获得。

Title: Resilient Class-Incremental Learning: on the Interplay of Drifting, Unlabelled and Imbalanced Data Streams

Authors: Jin Li, Kleanthis Malialis, Marios Polycarpou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09681
Pdf URL: https://arxiv.org/pdf/2602.09681
Copy Paste: [[2602.09681]] Resilient Class-Incremental Learning: on the Interplay of Drifting, Unlabelled and Imbalanced Data Streams(https://arxiv.org/abs/2602.09681)
Keywords: generation
Abstract: In today's connected world, the generation of massive streaming data across diverse domains has become commonplace. In the presence of concept drift, class imbalance, label scarcity, and new class emergence, they jointly degrade representation stability, bias learning toward outdated distributions, and reduce the resilience and reliability of detection in dynamic environments. This paper proposes SCIL (Streaming Class-Incremental Learning) to address these challenges. The SCIL framework integrates an autoencoder (AE) with a multi-layer perceptron for multi-class prediction, uses a dual-loss strategy (classification and reconstruction) for prediction and new class detection, employs corrected pseudo-labels for online training, manages classes with queues, and applies oversampling to handle imbalance. The rationale behind the method's structure is elucidated through ablation studies and a comprehensive experimental evaluation is performed using both real-world and synthetic datasets that feature class imbalance, incremental classes, and concept drifts. Our results demonstrate that SCIL outperforms strong baselines and state-of-the-art methods. Based on our commitment to Open Science, we make our code and datasets available to the community.
摘要：在当今的互联世界中，跨不同领域生成海量流数据已变得司空见惯。在存在概念漂移、类别不平衡、标签稀缺和新类别出现的情况下，它们共同降低了表示稳定性，使学习偏向过时的分布，并降低了动态环境中检测的弹性和可靠性。本文提出 SCIL（流式课堂增量学习）来应对这些挑战。 SCIL 框架将自动编码器 (AE) 与多层感知器集成以进行多类预测，使用双损失策略（分类和重构）进行预测和新类检测，采用校正的伪标签进行在线训练，使用队列管理类，并应用过采样来处理不平衡。通过消融研究阐明了该方法结构背后的基本原理，并使用具有类不平衡、增量类和概念漂移特征的现实世界和合成数据集进行了全面的实验评估。我们的结果表明 SCIL 优于强大的基线和最先进的方法。基于我们对开放科学的承诺，我们向社区提供我们的代码和数据集。

Title: Physics-informed diffusion models in spectral space

Authors: Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen
Subjects: cs.LG, cs.AI, cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2602.09708
Pdf URL: https://arxiv.org/pdf/2602.09708
Copy Paste: [[2602.09708]] Physics-informed diffusion models in spectral space(https://arxiv.org/abs/2602.09708)
Keywords: generative
Abstract: We propose a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier--Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at this https URL.
摘要：我们提出了一种将生成潜在扩散模型与基于物理的机器学习相结合的方法，以生成以部分观测为条件的参数偏微分方程（PDE）的解，其中特别包括正向和逆向 PDE 问题。我们通过缩放谱表示的潜在空间中的扩散过程来学习偏微分方程参数和解的联合分布，其中高斯噪声对应于具有受控规律性的函数。与基于网格的扩散模型相比，这种谱公式可以显着降低维数，并确保函数空间中的诱导过程保持在一类已明确定义 PDE 算子的函数内。在扩散后验采样的基础上，我们在推理过程中强制执行基于物理的约束和测量条件，在每个扩散步骤中应用基于 Adam 的更新。我们在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上评估了所提出的方法，证明与现有的基于扩散的 PDE 求解器相比，该方法具有更高的精度和计算效率，这些求解器是稀疏观测的最新技术。代码可从此 https URL 获取。

Title: Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Authors: Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09713
Pdf URL: https://arxiv.org/pdf/2602.09713
Copy Paste: [[2602.09713]] Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models(https://arxiv.org/abs/2602.09713)
Keywords: generation
Abstract: Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
摘要：Rigged 3D 资源是 3D 变形和动画的基础。然而，现有的 3D 生成方法在生成可动画几何体方面面临挑战，而绑定技术缺乏对骨架创建的细粒度结构控制。为了解决这些限制，我们引入了 Stroke3D，这是一种新颖的框架，可以根据用户输入直接生成装配网格：2D 绘制的笔划和描述性文本提示。我们的方法开创了一个两阶段管道，将生成分为：1）可控骨架生成，我们采用骨架图 VAE（Sk-VAE）将骨架的图结构编码到潜在空间中，其中骨架图 DiT（Sk-DiT）生成骨架嵌入。生成过程以语义文本和显式结构控制的 2D 笔画为条件，VAE 的解码器重建最终的高质量 3D 骨架； 2) 通过 TextuRig 和 SKA-DPO 增强网格合成，然后我们根据生成的骨架合成纹理网格。在这个阶段，我们首先通过使用 TextuRig 增强现有的骨架到网格模型的训练数据：TextuRig 是一个带有标题的纹理和装配网格数据集，由 Objaverse-XL 策划。此外，我们采用偏好优化策略 SKA-DPO，以骨架网格对齐分数为指导，以进一步提高几何保真度。我们的框架共同实现了更直观的工作流程，用于创建可立即制作动画的 3D 内容。据我们所知，我们的工作是第一个根据用户绘制的 2D 笔画生成装配 3D 网格的工作。大量实验证明 Stroke3D 可以生成合理的骨架和高质量的网格。

Title: Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

Authors: Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith
Subjects: cs.CV, cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2602.09730
Pdf URL: https://arxiv.org/pdf/2602.09730
Copy Paste: [[2602.09730]] Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings(https://arxiv.org/abs/2602.09730)
Keywords: restoration, generative
Abstract: Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.
摘要：成像技术、深度学习和数值性能的最新进展使得对艺术品进行非侵入式详细分析成为可能，支持其记录和保护。特别是，数字化绘画中裂纹的自动检测对于评估退化和指导修复至关重要，但由于可能复杂的场景以及裂纹和类似裂纹的艺术特征（例如笔触或头发）之间的视觉相似性，仍然具有挑战性。我们提出了一种混合方法，将裂纹检测建模为逆问题，将观察到的图像分解为无裂纹的绘画和裂纹组件。采用深度生成模型作为底层艺术品的强大先验，而使用 Mumford-Shah 型变分函数和裂纹先验来捕获裂纹结构。联合优化产生了绘画中裂纹定位的像素级图。

Title: Toward Fine-Grained Facial Control in 3D Talking Head Generation

Authors: Shaoyang Xie, Xiaofeng Cong, Baosheng Yu, Zhipeng Gui, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09736
Pdf URL: https://arxiv.org/pdf/2602.09736
Copy Paste: [[2602.09736]] Toward Fine-Grained Facial Control in 3D Talking Head Generation(https://arxiv.org/abs/2602.09736)
Keywords: generation
Abstract: Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
摘要：音频驱动的头像生成是数字化身的核心组成部分，3D Gaussian Splatting 在高保真头像实时渲染方面表现出了强大的性能。然而，实现对细粒度面部运动的精确控制仍然是一个重大挑战，特别是由于口型同步不准确和面部抖动，这两者都可能导致恐怖谷效应。为了应对这些挑战，我们提出了细粒度 3D 高斯分布 (FG-3DGS)，这是一种新颖的框架，可以实现时间一致和高保真头部说话。我们的方法引入了频率感知的解缠结策略，以根据运动特征显式地建模面部区域。低频区域（例如脸颊、鼻子和前额）使用标准 MLP 联合建模，而高频区域（包括眼睛和嘴巴）则使用由面部区域掩模引导的专用网络单独捕获。预测的运动动态（表示为高斯增量）应用于静态高斯以生成最终的头部帧，该头部帧通过光栅化器使用特定于帧的相机参数进行渲染。此外，还采用了通过预训练模型从大规模音频-视频对中学习的高频细化后渲染对齐机制，以增强每帧生成并实现更准确的唇形同步。对广泛使用的头像生成数据集进行的大量实验表明，我们的方法在生成高保真、口型同步的头像视频方面优于最新的最先进方法。

Title: Towards Poisoning Robustness Certification for Natural Language Generation

Authors: Mihnea Ghitu, Matthew Wicker
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09757
Pdf URL: https://arxiv.org/pdf/2602.09757
Copy Paste: [[2602.09757]] Towards Poisoning Robustness Certification for Natural Language Generation(https://arxiv.org/abs/2602.09757)
Keywords: generation
Abstract: Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA's effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.
摘要：了解自然语言生成的可靠性对于在安全敏感领域部署基础模型至关重要。虽然经过认证的中毒防御为分类任务提供了可证明的稳健性界限，但它们从根本上不适合自回归生成：它们无法处理顺序预测或语言模型的指数级大输出空间。为了建立经过认证的自然语言生成框架，我们形式化了两个安全属性：稳定性（对生成中任何变化的鲁棒性）和有效性（对生成中的有针对性的有害变化的鲁棒性）。我们引入了目标分区聚合（TPA），这是第一个通过计算诱导特定有害类别、标记或短语所需的最小中毒预算来证明有效性/有针对性的攻击的算法。此外，我们扩展了 TPA，以便使用混合整数线性规划 (MILP) 为多轮生成提供更严格的保证。根据经验，我们证明了 TPA 在不同设置中的有效性，包括：当对手修改最多 0.5% 的数据集时，验证代理工具调用的有效性，并在基于偏好的对齐中验证 8 个令牌的稳定性范围。尽管推理时间延迟仍然是一个开放的挑战，但我们的贡献使得能够在安全关键型应用程序中对语言模型进行认证部署。

Title: Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

Authors: Abhipsa Basu, Yugam Bahl, Kirti Bhagat, Preethi Seshadri, R. Venkatesh Babu, Danish Pruthi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09775
Pdf URL: https://arxiv.org/pdf/2602.09775
Copy Paste: [[2602.09775]] Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets(https://arxiv.org/abs/2602.09775)
Keywords: generation
Abstract: Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($\rho = 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
摘要：最近的研究表明，文本到图像模型通常无法生成具有地理代表性的图像，这引起了人们对其训练数据代表性的担忧，并引发了一个问题：这些训练示例来自世界的哪些地区？我们根据使用法学硕士从字幕中提取的位置信息，将图像字幕对映射到国家/地区，从而对大规模多模式数据集进行地理分析。研究三个广泛使用的数据集（Re-LAION、DataComp1B 和 Conceptual Captions）中涉及 20 美元常见实体（例如房屋、旗帜）的英文字幕，我们发现美国、英国和加拿大占样本的 48.0\%$，而南美和非洲国家的代表性严重不足，分别只有 $1.8\%$ 和 $3.8\%$ 的图像。我们观察到一个国家的 GDP 与其在数据中的表示形式之间存在很强的相关性 ($\rho = 0.82$)。检查 Re-LAION 数据集中 4 美元语言的非英语子集，我们发现代表性严重偏向主要使用这些语言的国家。此外，我们发现更高的表示并不一定意味着更大的视觉或语义多样性。最后，通过分析在 Re-LAION 上训练的 Stable Diffusion v1.3 生成的特定国家图像，我们发现虽然各代图像看起来很真实，但与真实世界图像相比，它们的覆盖范围受到严重限制。

Title: Explainability in Generative Medical Diffusion Models: A Faithfulness-Based Analysis on MRI Synthesis

Authors: Surjo Dey, Pallabi Saikia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09781
Pdf URL: https://arxiv.org/pdf/2602.09781
Copy Paste: [[2602.09781]] Explainability in Generative Medical Diffusion Models: A Faithfulness-Based Analysis on MRI Synthesis(https://arxiv.org/abs/2602.09781)
Keywords: generative
Abstract: This study investigates the explainability of generative diffusion models in the context of medical imaging, focusing on Magnetic resonance imaging (MRI) synthesis. Although diffusion models have shown strong performance in generating realistic medical images, their internal decision making process remains largely opaque. We present a faithfulness-based explainability framework that analyzes how prototype-based explainability methods like ProtoPNet (PPNet), Enhanced ProtoPNet (EPPNet), and ProtoPool can link the relationship between generated and training features. Our study focuses on understanding the reasoning behind image formation through denoising trajectory of diffusion model and subsequently prototype explainability with faithfulness analysis. Experimental analysis shows that EPPNet achieves the highest faithfulness (with score 0.1534), offering more reliable insights, and explainability into the generative process. The results highlight that diffusion models can be made more transparent and trustworthy through faithfulness-based explanations, contributing to safer and more interpretable applications of generative AI in healthcare.
摘要：本研究研究了医学成像背景下生成扩散模型的可解释性，重点是磁共振成像（MRI）合成。尽管扩散模型在生成逼真的医学图像方面表现出了强大的性能，但其内部决策过程在很大程度上仍然是不透明的。我们提出了一个基于忠实度的可解释性框架，该框架分析了 ProtoPNet (PPNet)、增强型 ProtoPNet (EPPNet) 和 ProtoPool 等基于原型的可解释性方法如何链接生成特征和训练特征之间的关系。我们的研究重点是通过扩散模型的去噪轨迹来理解图像形成背后的推理，并随后通过忠实度分析来了解原型的可解释性。实验分析表明，EPPNet 实现了最高的忠实度（得分为 0.1534），为生成过程提供了更可靠的见解和可解释性。结果表明，通过基于忠实性的解释，扩散模型可以变得更加透明和可信，从而有助于生成式人工智能在医疗保健领域的更安全、更可解释的应用。

Title: When Less is More: The LLM Scaling Paradox in Context Compression

Authors: Ruishan Guo, Yibing Liu, Guoxin Ma, Yan Wang, Yueyang Zhang, Long Xia, Kecheng Chen, Zhiyuan Sun, Daiting Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09789
Pdf URL: https://arxiv.org/pdf/2602.09789
Copy Paste: [[2602.09789]] When Less is More: The LLM Scaling Paradox in Context Compression(https://arxiv.org/abs/2602.09789)
Keywords: generation, generative
Abstract: Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor-decoder setup, we observe a Size-Fidelity Paradox: increasing the compressor size can lessen the faithfulness of reconstructed contexts though training loss decreases. Through extensive experiments across models from 0.6B to 90B, we coin this paradox arising from two dominant factors: 1) knowledge overwriting: larger models increasingly replace source facts with their own prior beliefs, e.g., ``the white strawberry'' $\to$ ``the red strawberry''; and 2) semantic drift: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, e.g., ``Alice hit Bob'' $\to$ ``Bob hit Alice''. By holding model size fixed, we reflect on the emergent properties of compressed context representations. We show that the culprit is not parameter count itself, but the excessive semantic capacity and amplified generative uncertainty that accompany scaling. Specifically, the increased rank of context embeddings facilitates prior knowledge intrusion, whereas higher entropy over token prediction distributions promotes rewriting. Our results complement existing evaluations over context compression paradigm, underpinning a breakdown in scaling laws for faithful preservation in open-ended generation.
摘要：长期以来，扩大模型参数一直是一种流行的训练范例，其驱动因素是较大的模型会产生卓越的生成能力。然而，在压缩器-解码器设置中的有损上下文压缩下，我们观察到大小保真度悖论：尽管训练损失减少，但增加压缩器大小可能会降低重建上下文的忠实度。通过对 0.6B 到 90B 模型的广泛实验，我们发现这个悖论源于两个主要因素：1）知识覆盖：较大的模型越来越多地用自己先验的信念取代源事实，例如“白草莓”$\to$“红草莓”； 2）语义漂移：较大的模型倾向于解释或重组内容，而不是逐字复制内容，例如“爱丽丝打鲍勃”$\to$“鲍勃打爱丽丝”。通过保持模型大小固定，我们反思了压缩上下文表示的涌现属性。我们表明，罪魁祸首不是参数计数本身，而是伴随缩放而来的过多的语义容量和放大的生成不确定性。具体来说，上下文嵌入的排名增加有利于先验知识入侵，而令牌预测分布的更高熵则促进重写。我们的结果补充了对上下文压缩范式的现有评估，支持了开放式生成中忠实保存的缩放法则的崩溃。

Title: Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson's disease and isolated REM sleep behavior disorder

Authors: Jesper Strøm, Casper Skjærbæk, Natasha Becker Bertelsen, Steffen Torpe Simonsen, Niels Okkels, David Bertram, Sinah Röttgen, Konstantin Kufer, Kaare B. Mikkelsen, Marit Otto, Poul Jørgen Jennum, Per Borghammer, Michael Sommerauer, Preben Kidmose
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2602.09793
Pdf URL: https://arxiv.org/pdf/2602.09793
Copy Paste: [[2602.09793]] Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson's disease and isolated REM sleep behavior disorder(https://arxiv.org/abs/2602.09793)
Keywords: generative
Abstract: Isolated REM sleep behavior disorder (iRBD) is a key prodromal marker of Parkinson's disease (PD), and video-polysomnography (vPSG) remains the diagnostic gold standard. However, manual sleep staging is particularly challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, making PSG assessments a bottleneck for deploying new RBD screening technologies at scale. We adapted U-Sleep, a deep neural network, for generalizable sleep staging in PD and iRBD. A pretrained U-Sleep model, based on a large publicly available, multisite non-neurodegenerative dataset (PUB; 19,236 PSGs across 12 sites), was fine-tuned on research datasets from two centers (Lundbeck Foundation Parkinson's Disease Research Center (PACE) and the Cologne-Bonn Cohort (CBC); 112 PD, 138 iRBD, 89 age-matched controls. The resulting model was evaluated on an independent dataset from the Danish Center for Sleep Medicine (DCSM; 81 PD, 36 iRBD, 87 sleep-clinic controls). A subset of PSGs with low agreement between the human rater and the model (\k{appa} < 0.6) was re-scored by a second blinded human rater to identify sources of disagreement. Finally, we applied confidence-based thresholds to optimize REM sleep staging. The pretrained model achieved mean \k{appa} = 0.81 in PUB, but \k{appa} = 0.66 when applied directly to PACE/CBC. By fine-tuning the model, we developed a generalized model with \k{appa} = 0.74 on PACE/CBC (p < 0.001 vs. the pretrained model). In DCSM, mean and median \k{appa} increased from 0.60 to 0.64 (p < 0.001) and 0.64 to 0.69 (p < 0.001), respectively. In the interrater study, PSGs with low agreement between the model and the initial scorer showed similarly low agreement between human scorers. Applying a confidence threshold increased the proportion of correctly identified REM sleep epochs from 85% to 95.5%, while preserving sufficient (> 5 min) REM sleep for 95% of subjects.
摘要：孤立性快速眼动睡眠行为障碍 (iRBD) 是帕金森病 (PD) 的关键前驱标志物，而视频多导睡眠图 (vPSG) 仍然是诊断的金标准。然而，由于脑电图异常和睡眠碎片化，手动睡眠分期在神经退行性疾病中尤其具有挑战性，这使得 PSG 评估成为大规模部署新 RBD 筛查技术的瓶颈。我们采用深度神经网络 U-Sleep 来进行 PD 和 iRBD 的通用睡眠分期。基于大型公开多站点非神经退行性数据集（PUB；跨 12 个站点的 19,236 个 PSG）的预训练 U-Sleep 模型根据两个中心（灵北基金会帕金森病研究中心 (PACE) 和科隆-波恩队列 (CBC)；112 个 PD、138 个 iRBD、89 个年龄匹配对照）的研究数据集进行了微调。该模型在丹麦睡眠医学中心的独立数据集（DCSM；81 个 PD、36 个 iRBD、87 个睡眠诊所对照）上进行了评估，由第二个盲人评估者对人类评估者与模型之间一致性较低的 PSG 子集进行了重新评分，以识别不一致的来源。最后，我们应用基于置信度的阈值来优化 REM 睡眠分期。 PUB 中的 \k{appa} = 0.81，但直接应用于 PACE/CBC 时 \k{appa} = 0.66 通过微调模型，我们开发了 PACE/CBC 上 \k{appa} = 0.74 的广义模型（与预训练模型相比，p < 0.001）。分别为 0.001）和 0.64 至 0.69（p < 0.001），在模型与初始评分者之间的一致性较低的 PSG 中，应用置信度阈值将正确识别的 REM 睡眠时期的比例从 85% 增加到 95.5%，同时为 95% 的受试者保留了充足的（> 5 分钟）REM 睡眠。

Title: SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Authors: Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09809
Pdf URL: https://arxiv.org/pdf/2602.09809
Copy Paste: [[2602.09809]] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing(https://arxiv.org/abs/2602.09809)
Keywords: generation
Abstract: Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
摘要：科学图表传达了明确的结构信息，但现代文本到图像模型通常会产生视觉上合理但结构上不正确的结果。现有的基准要么依赖于以图像为中心的或对结构不敏感的主观指标，要么评估中间符号表示而不是最终渲染的图像，从而导致基于像素的图表生成尚未得到充分探索。我们引入了 SciFlow-Bench，这是一种结构优先的基准，用于直接从像素级输出评估科学图表的生成。 SciFlow-Bench 以真正的科学 PDF 为基础，将每个源框架图与规范的地面实况图配对，并在闭环、往返协议下将模型评估为黑盒图像生成器，该协议将生成的图表图像反向解析回结构化图以进行比较。该设计通过结构可恢复性而不是仅通过视觉相似性来强制进行评估，并通过协调规划、感知和结构推理的分层多智能体系统来实现。实验表明，保持结构正确性仍然是一个基本挑战，特别是对于具有复杂拓扑的图，这强调了结构感知评估的必要性。

Title: SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding

Authors: Zhaoxu Li, Chenqi Kong, Peijun Bao, Song Xia, Yi Tu, Yi Yu, Xinghao Jiang, Xudong Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09825
Pdf URL: https://arxiv.org/pdf/2602.09825
Copy Paste: [[2602.09825]] SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding(https://arxiv.org/abs/2602.09825)
Keywords: generation
Abstract: Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
摘要：大视觉语言模型 (LVLM) 中的幻觉在现实应用中带来了重大的安全和可靠性风险。受人类在不确定或犹豫时更容易出错这一观察的启发，我们研究了模型内部知识的不稳定性如何导致 LVLM 幻觉。我们从三个角度（即注意力头、模型层和解码令牌）进行了广泛的实证分析，并确定了三种关键的幻觉模式：（i）注意力头之间的视觉激活漂移，（ii）跨层的明显知识波动，以及（iii）相邻输出令牌之间的视觉焦点分散。基于这些发现，我们提出了稳定性感知知识增强解码（SAKED），它引入了分层知识稳定性评分（KSS）来量化整个模型的知识稳定性。通过对比最稳定的感知层和与稳定性无关的层，SAKED 抑制解码噪声并动态利用最可靠的内部知识来忠实地生成令牌。此外，SAKED无需培训，可以无缝集成到不同的架构中。大量实验表明，SAKED 在各种模型、任务和基准测试中实现了最先进的幻觉缓解性能。

Title: Kelix Technique Report

Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09843
Pdf URL: https://arxiv.org/pdf/2602.09843
Copy Paste: [[2602.09843]] Kelix Technique Report(https://arxiv.org/abs/2602.09843)
Keywords: generation
Abstract: Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
摘要：自回归大语言模型 (LLM) 通过将不同的任务表示为离散的自然语言标记序列并通过下一个标记预测进行训练，从而在自我监督下统一理解和生成，从而可以很好地扩展。将这种范式扩展到多模态数据需要跨模态的共享、离散表示。然而，大多数视觉语言模型 (VLM) 仍然依赖于混合接口：离散文本标记与连续视觉变换器 (ViT) 功能配对。由于监督很大程度上是文本驱动的，这些模型往往偏向于理解，无法充分利用对非文本数据的大规模自监督学习。最近的工作探索了离散视觉标记化，以实现完全自回归多模态建模，显示出在统一理解和生成方面取得的有希望的进展。然而，由于代码容量有限，现有的离散视觉令牌经常丢失信息，导致理解能力明显弱于连续特征 VLM。我们提出 Kelix，一个完全离散的自回归统一模型，它缩小了离散和连续视觉表示之间的理解差距。

Title: CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization

Authors: Beicheng Xu, Keyao Ding, Wei Liu, Yupeng Lu, Bin Cui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.09851
Pdf URL: https://arxiv.org/pdf/2602.09851
Copy Paste: [[2602.09851]] CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization(https://arxiv.org/abs/2602.09851)
Keywords: generation
Abstract: Feature Engineering (FE) is pivotal in automated machine learning (AutoML) but remains a bottleneck for traditional methods, which treat it as a black-box search, operating within rigid, predefined search spaces and lacking domain awareness. While Large Language Models (LLMs) offer a promising alternative by leveraging semantic reasoning to generate unbounded operators, existing methods fail to construct free-form FE pipelines, remaining confined to isolated subtasks such as feature generation. Most importantly, they are rarely optimized jointly with hyperparameter optimization (HPO) of the ML model, leading to greedy "FE-then-HPO" workflows that cannot capture strong FE-HPO interactions. In this paper, we present CoFEH, a collaborative framework that interleaves LLM-based FE and Bayesian HPO for robust end-to-end AutoML. CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (ToT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that realizes interleaved optimization by adaptively scheduling FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism that shares context between LLM and BO, enabling mutually informed decisions. Experiments show that CoFEH not only outperforms traditional and LLM-based FE baselines, but also achieves superior end-to-end performance under joint optimization.
摘要：特征工程 (FE) 在自动化机器学习 (AutoML) 中至关重要，但仍然是传统方法的瓶颈，传统方法将其视为黑盒搜索，在严格的预定义搜索空间内运行且缺乏领域感知。虽然大型语言模型 (LLM) 通过利用语义推理来生成无界运算符，提供了一种有前途的替代方案，但现有方法无法构建自由形式的有限元管道，仍然局限于孤立的子任务，例如特征生成。最重要的是，它们很少与 ML 模型的超参数优化 (HPO) 联合进行优化，从而导致贪婪的“FE-then-HPO”工作流程，无法捕获强大的 FE-HPO 交互。在本文中，我们提出了 CoFEH，这是一个协作框架，它将基于 LLM 的 FE 和贝叶斯 HPO 交织在一起，以实现强大的端到端 AutoML。 CoFEH 使用由思想树 (ToT) 支持的 LLM 驱动的 FE 优化器来探索灵活的 FE 管道，使用贝叶斯优化 (BO) 模块来解决 HPO，以及动态优化器选择器，通过自适应调度 FE 和 HPO 步骤来实现交错优化。至关重要的是，我们引入了一种相互调节机制，可以在 LLM 和 BO 之间共享背景，从而实现相互知情的决策。实验表明，CoFEH 不仅优于传统和基于 LLM 的 FE 基线，而且在联合优化下实现了卓越的端到端性能。

Title: Code2World: A GUI World Model via Renderable Code Generation

Authors: Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin
Subjects: cs.CV, cs.AI, cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2602.09856
Pdf URL: https://arxiv.org/pdf/2602.09856
Copy Paste: [[2602.09856]] Code2World: A GUI World Model via Renderable Code Generation(https://arxiv.org/abs/2602.09856)
Keywords: generation
Abstract: Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
摘要：自主 GUI 代理通过感知界面并执行操作与环境进行交互。作为一个虚拟沙箱，GUI World 模型通过启用动作条件预测，使代理具有类似人类的远见。然而，现有的基于文本和像素的方法很难同时实现高视觉保真度和细粒度的结构可控性。为此，我们提出了 Code2World，一种视觉语言编码器，可通过可渲染代码生成来模拟下一个视觉状态。具体来说，为了解决数据稀缺问题，我们通过将 GUI 轨迹转换为高保真 HTML 并通过视觉反馈修订机制完善合成代码来构建 AndroidCode，从而生成超过 80K 高质量屏幕操作对的语料库。为了使现有的 VLM 适应代码预测，我们首先执行 SFT 作为格式布局遵循的冷启动，然后进一步应用渲染感知强化学习，通过强制视觉语义保真度和动作一致性，使用渲染结果作为奖励信号。大量实验表明，Code2World-8B 实现了性能最佳的下一个 UI 预测，可与竞争性的 GPT-5 和 Gemini-3-Pro-Image 相媲美。值得注意的是，Code2World 以灵活的方式显着提高了下游导航的成功率，使 Gemini-2.5-Flash 在 AndroidWorld 导航上提高了 9.5%。该代码可从此 https URL 获取。

Title: Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence

Authors: Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian, Guo Lu, Wenjun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09868
Pdf URL: https://arxiv.org/pdf/2602.09868
Copy Paste: [[2602.09868]] Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence(https://arxiv.org/abs/2602.09868)
Keywords: generation, generative
Abstract: Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
摘要：基于视频生成领域的最新进展，生成视频压缩已成为实现视觉上令人愉悦的重建的新范例。然而，现有方法对时间相关性的利用有限，导致在超低比特率下出现明显的闪烁和时间相干性下降。在本文中，我们提出了 Free-GVC，这是一种免训练的生成视频压缩框架，它将视频编码重新表述为由视频扩散先验引导的潜在轨迹压缩。我们的方法在图片组（GOP）级别上运行，将视频片段编码到紧凑的潜在空间中，并沿着扩散轨迹逐步压缩它们。为了确保跨 GOP 的感知一致重建，我们引入了自适应质量控制模块，该模块动态构建在线速率感知代理模型来预测每个 GOP 的最佳扩散步骤。此外，GOP 间对齐模块可建立帧重叠并在相邻组之间执行潜在融合，从而减轻闪烁并增强时间一致性。实验表明，与最新的神经编解码器 DCVC-RT 相比，Free-GVC 在 DISTS 中实现了平均 93.29% 的 BD-Rate 降低，并且用户研究进一步证实了其在超低比特率下的卓越感知质量和时间一致性。

Title: MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Authors: Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09878
Pdf URL: https://arxiv.org/pdf/2602.09878
Copy Paste: [[2602.09878]] MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation(https://arxiv.org/abs/2602.09878)
Keywords: generation, generative
Abstract: World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
摘要：基于世界模型的“想象然后行动”成为机器人操纵的一个有前景的范例，但现有方法通常支持纯粹基于图像的预测或对部分 3D 几何图形的推理，限制了它们预测完整 4D 场景动态的能力。这项工作提出了一种新颖的具体化 4D 世界模型，可实现几何一致的任意视图 RGBD 生成：仅将单视图 RGBD 观察作为输入，该模型会想象剩余的视点，然后可以对这些视点进行反向投影和融合，以跨时间组装更完整的 3D 结构。为了有效地学习多视图、跨模态生成，我们明确设计了跨视图和跨模态特征融合，共同促进 RGB 和深度之间的一致性，并强制跨视图的几何对齐。除了预测之外，将生成的未来转换为行动通常是通过逆动态来处理的，这是不适定的，因为多个行动可以解释相同的转变。我们通过测试时动作优化策略来解决这个问题，该策略通过生成模型进行反向传播，以推断出与预测的未来最匹配的轨迹级潜在特征，以及残差逆动力学模型，将该轨迹先验转化为准确的可执行动作。对三个数据集的实验证明了 4D 场景生成和下游操作的强大性能，并且消融为关键设计选择提供了实用的见解。

Title: AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization

Authors: Shaoqiu Zhang, Zizhong Ding, Kaicheng Yang, Junyi Wu, Xianglong Yan, Xi Li, Bingnan Duan, Jianping Fang, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.09883
Pdf URL: https://arxiv.org/pdf/2602.09883
Copy Paste: [[2602.09883]] AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization(https://arxiv.org/abs/2602.09883)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at this https URL.
摘要：扩散变压器 (DiT) 已成为高保真图像和视频生成的最先进的支柱。然而，它们巨大的计算成本和内存占用阻碍了在边缘设备上的部署。虽然训练后量化 (PTQ) 已被证明对大型语言模型 (LLM) 有效，但由于忽略了扩散过程中固有的独特时间动态，直接将现有方法应用于 DiT 会产生次优结果。在本文中，我们提出了 AdaTSQ，这是一种新颖的 PTQ 框架，它通过利用 DiT 的时间敏感性来推动效率和质量的帕累托前沿。首先，我们提出了帕累托感知时间步动态位宽分配策略。我们将量化策略搜索建模为受限寻路问题。我们利用由端到端重建误差引导的波束搜索算法来跨不同时间步动态分配分层位宽。其次，我们提出了费舍尔引导的时间校准机制。它利用时态 Fisher 信息对来自高度敏感时间步长的校准数据进行优先级排序，与基于 Hessian 的权重优化无缝集成。对四种先进 DiT（例如 Flux-Dev、Flux-Schnell、Z-Image 和 Wan2.1）的大量实验表明，AdaTSQ 的性能显着优于 SVDQuant 和 ViDiT-Q 等最先进的方法。我们的代码将在此 https URL 发布。

Title: Monocular Normal Estimation via Shading Sequence Estimation

Authors: Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09929
Pdf URL: https://arxiv.org/pdf/2602.09929
Copy Paste: [[2602.09929]] Monocular Normal Estimation via Shading Sequence Estimation(https://arxiv.org/abs/2602.09929)
Keywords: generative
Abstract: Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
摘要：单目法线估计旨在从任意光照下物体的单个 RGB 图像估计法线图。现有方法依赖深度模型来直接预测法线贴图。然而，它们经常遭受 3D 未对准的影响：虽然估计的法线贴图可能看起来具有正确的外观，但重建的表面通常无法与几何细节对齐。我们认为这种错位源于当前的范式：模型难以区分和重建法线贴图中表示的不同几何形状，因为底层几何形状的差异仅通过相对微妙的颜色变化反映出来。为了解决这个问题，我们提出了一种新的范式，将法线估计重新表述为着色序列估计，其中着色序列对各种几何信息更加敏感。在此范例的基础上，我们提出了 RoSE，一种利用图像到视频生成模型来预测着色序列的方法。然后通过解决简单的普通最小二乘问题将预测的着色序列转换为法线贴图。为了增强鲁棒性并更好地处理复杂对象，RoSE 在具有不同形状、材料和光照条件的合成数据集 MultiShade 上进行训练。实验表明，RoSE 在基于对象的单目法线估计的真实世界基准数据集上实现了最先进的性能。

Title: A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Authors: Chenruo Liu, Yijun Dong, Yiqiu Shen, Qi Lei
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2602.10014
Pdf URL: https://arxiv.org/pdf/2602.10014
Copy Paste: [[2602.10014]] A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula(https://arxiv.org/abs/2602.10014)
Keywords: generative
Abstract: Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.
摘要：迭代自我改进根据 LLM 本身生成的奖励验证输出对自回归大语言模型 (LLM) 进行微调。与自我改进的经验成功相比，这种在实际的有限样本环境中生成、迭代过程的理论基础仍然有限。我们通过将每一轮自我改进建模为奖励过滤分布的最大似然微调，并导出预期奖励的有限样本保证，从而朝着这一目标取得进展。我们的分析揭示了一个明确的反馈循环，更好的模型每次迭代接受更多数据，支持持续的自我改进，同时解释这种改进的最终饱和。通过考虑具有多个难度级别的推理任务，采用以任务为中心的观点，我们进一步证明了模型初始化、任务难度和样本预算的可量化条件，其中由易到难的课程可比固定任务混合训练获得更好的保证。我们的分析通过蒙特卡罗模拟和基于图形的推理任务的受控实验进行了验证。

Title: Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection

Authors: Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10042
Pdf URL: https://arxiv.org/pdf/2602.10042
Copy Paste: [[2602.10042]] Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection(https://arxiv.org/abs/2602.10042)
Keywords: generative
Abstract: Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
摘要：最近的研究表明，将思想链 (CoT) 推理纳入检测过程可以增强模型检测合成图像的能力。然而，过长的推理会带来大量的资源开销，包括令牌消耗和延迟，这在处理明显生成的伪造时尤其多余。为了解决这个问题，我们提出了 Fake-HR1，这是一种大规模混合推理模型，据我们所知，它是第一个根据生成检测任务的特征自适应地确定是否需要推理的模型。为了实现这一目标，我们设计了一个两阶段的训练框架：我们首先执行混合微调（HFT）进行冷启动初始化，然后使用混合推理分组策略优化（HGRPO）进行在线强化学习，以隐式学习何时选择合适的推理模式。实验结果表明，Fake-HR1能够自适应地跨不同类型的查询进行推理，在推理能力和生成检测性能上都超越了现有的LLM，同时显着提高了响应效率。

Title: WildCat: Near-Linear Attention in Theory and Practice

Authors: Tobias Schröder, Lester Mackey
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2602.10056
Pdf URL: https://arxiv.org/pdf/2602.10056
Copy Paste: [[2602.10056]] WildCat: Near-Linear Attention in Theory and Practice(https://arxiv.org/abs/2602.10056)
Keywords: generation
Abstract: We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.
摘要：我们引入了 WildCat，一种高精度、低成本的方法来压缩神经网络中的注意力机制。虽然注意力是现代网络架构的主要内容，但由于资源需求随输入序列长度 $n$ 呈二次方扩展，因此部署起来也非常昂贵。 WildCat 通过只参与一个小的加权核心集来避免这些二次成本。至关重要的是，我们使用快速但光谱准确的子采样算法（随机旋转 Cholesky）来选择核心集，并对元素进行最佳加权以最小化重建误差。值得注意的是，给定有界输入，WildCat 在近线性 $O(n^{1+o(1)})$ 时间内运行时，以超多项式 $O(n^{-\sqrt{\log(\log(n))}})$ 误差衰减近似精确注意力。相比之下，先前的实际近似要么缺乏误差保证，要么需要二次运行时间来保证如此高的保真度。我们将这一进步与 GPU 优化的 PyTorch 实现和一系列基准实验结合起来，展示了 WildCat 在图像生成、图像分类和语言模型 KV 缓存压缩方面的优势。

Title: Causality in Video Diffusers is Separable from Denoising

Authors: Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, Zongze Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10095
Pdf URL: https://arxiv.org/pdf/2602.10095
Copy Paste: [[2602.10095]] Causality in Video Diffusers is Separable from Denoising(https://arxiv.org/abs/2602.10095)
Keywords: generation, generative
Abstract: Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
摘要：因果关系——指的是组件之间的时间性、单向因果关系——是许多复杂生成过程的基础，包括视频、语言和机器人轨迹。当前的因果扩散模型将时间推理与迭代去噪结合起来，在所有层、每个去噪步骤以及整个上下文中应用因果注意力。在本文中，我们证明这些模型中的因果推理与多步骤去噪过程是可分离的。通过对自回归视频扩散器的系统探测，我们发现了两个关键规律：（1）早期层在去噪步骤中产生高度相似的特征，表明沿扩散轨迹的冗余计算；（2）更深的层表现出稀疏的跨帧注意力，并且主要执行帧内渲染。受这些发现的启发，我们引入了可分离因果扩散（SCD），这是一种新架构，它通过因果变换编码器将每帧一次的时间推理与通过轻量级扩散解码器的多步逐帧渲染明确解耦。对合成基准和真实基准的训练前和训练后任务进行的大量实验表明，SCD 显着提高了吞吐量和每帧延迟，同时匹配或超越了强因果扩散基准的生成质量。

Title: Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Authors: Amandeep Kumar, Vishal M. Patel
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2602.10099
Pdf URL: https://arxiv.org/pdf/2602.10099
Copy Paste: [[2602.10099]] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders(https://arxiv.org/abs/2602.10099)
Keywords: generative
Abstract: Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
摘要：利用表示编码器进行生成建模提供了一条高效、高保真合成的途径。然而，标准扩散变换器无法直接收敛于这些表示。虽然最近的工作将此归因于容量瓶颈，提出了计算昂贵的扩散变压器宽度缩放，但我们证明故障基本上是几何的。我们将几何干扰视为根本原因：标准欧几里得流匹配迫使概率路径穿过表示编码器的超球面特征空间的低密度内部，而不是遵循流形表面。为了解决这个问题，我们提出了带有雅可比正则化的黎曼流匹配（RJF）。通过将生成过程限制为流形测地线并校正曲率引起的误差传播，RJF 使标准扩散变压器架构能够在不缩放宽度的情况下收敛。我们的方法 RJF 使标准 DiT-B 架构（131M 参数）能够有效收敛，实现 FID 为 3.37，而之前的方法无法收敛。代码：这个https URL

Title: VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Authors: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.10102
Pdf URL: https://arxiv.org/pdf/2602.10102
Copy Paste: [[2602.10102]] VideoWorld 2: Learning Transferable Knowledge from Real-world Videos(https://arxiv.org/abs/2602.10102)
Keywords: generation
Abstract: Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
摘要：从未标记的视频数据中学习可转移的知识并将其应用到新环境中是智能代理的基本能力。这项工作提出了 VideoWorld 2，它扩展了 VideoWorld，并首次对直接从原始现实世界视频中学习可转移知识进行了研究。 VideoWorld 2 的核心引入了动态增强的潜在动态模型 (dLDM)，它将动作动态与视觉外观分离：预训练的视频扩散模型处理视觉外观建模，使 dLDM 能够学习专注于紧凑且有意义的任务相关动态的潜在代码。然后对这些潜在代码进行自回归建模，以学习任务策略并支持长期推理。我们在具有挑战性的现实世界手工制作任务中评估了 VideoWorld 2，其中先前的视频生成和潜在动态模型难以可靠运行。值得注意的是，VideoWorld 2 将任务成功率提高了 70%，并生成连贯的长执行视频。在机器人技术中，我们证明 VideoWorld 2 可以从 Open-X 数据集中获取有效的操作知识，这大大提高了 CALVIN 上的任务性能。这项研究揭示了直接从原始视频中学习可转移的世界知识的潜力，所有代码、数据和模型都将开源以供进一步研究。

Title: ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Authors: Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.10113
Pdf URL: https://arxiv.org/pdf/2602.10113
Copy Paste: [[2602.10113]] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation(https://arxiv.org/abs/2602.10113)
Keywords: generation
Abstract: Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at this https URL.
摘要：图像到视频生成 (I2V) 将静态图像按照文本指令动画化为时间连贯的视频序列，但在不断变化的视点下保留细粒度的对象身份仍然是一个持续的挑战。与文本到视频模型不同，现有的 I2V 管道经常遭受外观漂移和几何失真的影响，我们将这些伪影归因于单视图 2D 观察的稀疏性和弱的跨模式对齐。这里我们从数据和模型两个角度来解决这个问题。首先，我们策划 ConsIDVid，这是一个以可扩展管道构建的大规模以对象为中心的数据集，用于高质量、时间对齐的视频，并建立 ConsIDVid-Bench，在其中我们使用对细微几何和外观偏差敏感的指标，提出了一种新颖的多视图一致性基准测试和评估框架。我们进一步提出 ConsID-Gen，一种视图辅助的 I2V 生成框架，它通过未设置的辅助视图增强第一帧，并通过双流视觉几何编码器以及文本视觉连接器融合语义和结构线索，为 Diffusion Transformer 主干产生统一条件。 ConsIDVid-Bench 的实验表明，ConsID-Gen 在多个指标上始终表现出色，其最佳整体性能超越了 Wan2.1 和 HunyuanVideo 等领先的视频生成模型，在具有挑战性的现实场景下提供卓越的身份保真度和时间一致性。我们将在此 https URL 发布我们的模型和数据集。

Title: SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Authors: Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2602.10116
Pdf URL: https://arxiv.org/pdf/2602.10116
Copy Paste: [[2602.10116]] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI(https://arxiv.org/abs/2602.10116)
Keywords: generation
Abstract: Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: this https URL.
摘要：实体代理的真实世界数据收集仍然昂贵且不安全，需要可扩展、真实且可用于模拟器的 3D 环境。然而，现有的场景生成系统通常依赖于基于规则或特定于任务的管道，从而产生伪像和物理上无效的场景。我们提出了 SAGE，一个代理框架，给定用户指定的具体任务（例如，“拿起一个碗并将其放在桌子上”），它可以理解意图并自动大规模生成模拟就绪环境。该代理将多个用于布局和对象组合的生成器与评估语义合理性、视觉真实性和物理稳定性的评论家结合起来。通过迭代推理和自适应工具选择，它可以自我完善场景，直到满足用户意图和物理有效性。由此产生的环境是真实的、多样化的，并且可以直接部署在现代模拟器中进行政策培训。纯粹基于这些数据训练的策略表现出明显的扩展趋势，并推广到看不见的对象和布局，展示了模拟驱动的扩展对具体人工智能的前景。代码、演示和 SAGE-10k 数据集可以在项目页面上找到：此 https URL。