2026-03-03

Title: StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser

Authors: Jintao Zhang, Zirui Liu, Mingyue Cheng, Xianquan Wang, Zhiding Liu, Qi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00037
Pdf URL: https://arxiv.org/pdf/2603.00037
Copy Paste: [[2603.00037]] StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser(https://arxiv.org/abs/2603.00037)
Keywords: restoration
Abstract: Diffusion models have been used for probabilistic time series forecasting and show strong potential. However, fixed noise schedules often produce intermediate states that are hard to invert and a terminal state that deviates from the near noise assumption. Meanwhile, prior methods rely on time domain conditioning and seldom model schedule induced spectral degradation, which limits structure recovery across noise levels. We propose StaTS, a diffusion model for probabilistic time series forecasting that learns the noise schedule and the denoiser through alternating updates. StaTS includes Spectral Trajectory Scheduler (STS) that learns a data adaptive noise schedule with spectral regularization to improve structural preservation and stepwise invertibility, and Frequency Guided Denoiser (FGD) that estimates schedule induced spectral distortion and uses it to modulate denoising strength for heterogeneous restoration across diffusion steps and variables. A two stage training procedure stabilizes the coupling between schedule learning and denoiser optimization. Experiments on multiple real world benchmarks show consistent gains, while maintaining strong performance with fewer sampling steps. Our code is available at this https URL.
摘要：扩散模型已用于概率时间序列预测并显示出强大的潜力。然而，固定噪声表通常会产生难以反转的中间状态和偏离近噪声假设的最终状态。同时，现有方法依赖于时域条件，很少对调度引起的频谱退化进行建模，这限制了跨噪声水平的结构恢复。我们提出了 StaTS，一种用于概率时间序列预测的扩散模型，它通过交替更新来学习噪声调度和降噪器。 StaTS 包括频谱轨迹调度器 (STS)，它通过频谱正则化学习数据自适应噪声调度，以改善结构保留和逐步可逆性；以及频率引导降噪器 (FGD)，它估计调度引起的频谱失真，并使用它来调制降噪强度，以实现跨扩散步骤和变量的异构恢复。两阶段训练过程稳定了调度学习和降噪器优化之间的耦合。对多个现实世界基准的实验显示出一致的增益，同时以更少的采样步骤保持强大的性能。我们的代码可以在这个 https URL 上找到。

Title: Breaking the Factorization Barrier in Diffusion Language Models

Authors: Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00045
Pdf URL: https://arxiv.org/pdf/2603.00045
Copy Paste: [[2603.00045]] Breaking the Factorization Barrier in Diffusion Language Models(https://arxiv.org/abs/2603.00045)
Keywords: generation
Abstract: Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a trade-off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies. Code available at: this https URL
摘要：扩散语言模型理论上允许高效的并行生成，但实际上受到“因式分解障碍”的阻碍：同时预测的标记是独立的假设。这种限制迫使我们做出权衡：模型必须要么通过顺序解决依赖关系来牺牲速度，要么因因式分解而遭受不连贯的影响。我们认为，这种障碍并非源于有限的主干表达能力，而是源于结构性错误指定：模型仅限于完全分解的输出，因为显式参数化联合分布将要求 Transformer 输出大量参数。我们提出了耦合离散扩散（CoDD），这是一种混合框架，它通过用轻量级、易于处理的概率推理层替换完全分解的输出分布来打破这一障碍。该公式产生的分布族比标准分解先验更具表现力，能够对复杂的联合依赖性进行建模，同时保持足够紧凑，以避免与全联合建模相关的参数爆炸。根据经验，CoDD 以可忽略不计的开销无缝增强了各种扩散语言模型架构，以训练成本的一小部分与计算密集型强化学习基线的推理性能相匹配。此外，它还可以防止几步生成中的性能崩溃，从而以显着降低的延迟实现高质量输出。代码位于：此 https URL

Title: BiJEPA: Bi-directional Joint Embedding Predictive Architecture for Symmetric Representation Learning

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00049
Pdf URL: https://arxiv.org/pdf/2603.00049
Copy Paste: [[2603.00049]] BiJEPA: Bi-directional Joint Embedding Predictive Architecture for Symmetric Representation Learning(https://arxiv.org/abs/2603.00049)
Keywords: generation
Abstract: Self-Supervised Learning (SSL) has shifted from pixel-level reconstruction to latent space prediction, spearheaded by the Joint Embedding Predictive Architecture (JEPA). While effective, standard JEPA models typically rely on a uni-directional prediction mechanism (e.g. Context $\to$ Target), potentially neglecting the informative signal inherent in the inverse relationship, degrading its performance. In this work, we propose \textbf{BiJEPA}, a \textit{Bi-Directional Joint Embedding Predictive Architecture} that enforces cycle-consistent predictability between data segments. We address the inherent instability of symmetric prediction (representation explosion) by introducing a critical norm regularization mechanism on the representation vectors. We evaluate BiJEPA on three distinct modalities: synthetic periodic signals, chaotic Lorenz attractor trajectories, and high-dimensional image data (MNIST). Our results demonstrate that BiJEPA achieves stable convergence without collapse, captures the semantic structure of chaotic systems, and learns robust temporal and spatial representations capable of generation and generalisation, offering a more holistic approach to representation learning.
摘要：在联合嵌入预测架构 (JEPA) 的引领下，自监督学习 (SSL) 已从像素级重建转向潜在空间预测。虽然有效，但标准 JEPA 模型通常依赖于单向预测机制（例如上下文 $\to$ Target），可能会忽略逆关系中固有的信息信号，从而降低其性能。在这项工作中，我们提出了 \textbf{BiJEPA}，一种 \textit{双向联合嵌入预测架构}，它强制数据段之间的循环一致可预测性。我们通过在表示向量上引入临界范数正则化机制来解决对称预测（表示爆炸）固有的不稳定性。我们在三种不同的模式上评估 BiJEPA：合成周期信号、混沌洛伦兹吸引子轨迹和高维图像数据 (MNIST)。我们的结果表明，BiJEPA 实现了稳定的收敛而不崩溃，捕获了混沌系统的语义结构，并学习了能够生成和泛化的鲁棒时空表示，从而提供了一种更全面的表示学习方法。

Title: Knowledge-guided generative surrogate modeling for high-dimensional design optimization under scarce data

Authors: Bingran Wang, Seongha Jeong, Sebastiaan P. C. van Schie, Dongyeon Han, Jaeho Min, John T. Hwang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00052
Pdf URL: https://arxiv.org/pdf/2603.00052
Copy Paste: [[2603.00052]] Knowledge-guided generative surrogate modeling for high-dimensional design optimization under scarce data(https://arxiv.org/abs/2603.00052)
Keywords: generative
Abstract: Surrogate models are widely used in mechanical design and manufacturing process optimization, where high-fidelity computational models may be unavailable or prohibitively expensive. Their effectiveness, however, is often limited by data scarcity, as purely data-driven surrogates struggle to achieve high predictive accuracy in such situations. Subject matter experts (SMEs) frequently possess valuable domain knowledge about functional relationships, yet few surrogate modeling techniques can systematically integrate this information with limited data. We address this challenge with RBF-Gen, a knowledge-guided surrogate modeling framework that combines scarce data with domain knowledge. This method constructs a radial basis function (RBF) space with more centers than training samples and leverages the null space via a generator network, inspired by the principle of maximum information preservation. The introduced latent variables provide a principled mechanism to encode structural relationships and distributional priors during training, thereby guiding the surrogate toward physically meaningful solutions. Numerical studies demonstrate that RBF-Gen significantly outperforms standard RBF surrogates on 1D and 2D structural optimization problems in data-scarce settings, and achieves superior predictive accuracy on a real-world semiconductor manufacturing dataset. These results highlight the potential of combining limited experimental data with domain expertise to enable accurate and practical surrogate modeling in mechanical and process design problems.
摘要：替代模型广泛用于机械设计和制造过程优化，其中高保真计算模型可能不可用或昂贵得令人望而却步。然而，它们的有效性往往受到数据稀缺的限制，因为纯粹数据驱动的替代者在这种情况下很难实现高预测准确性。主题专家 (SME) 通常拥有有关功能关系的宝贵领域知识，但很少有代理建模技术可以系统地将这些信息与有限的数据集成。我们使用 RBF-Gen 来应对这一挑战，这是一种知识引导的代理建模框架，它将稀缺数据与领域知识相结合。该方法受最大信息保存原理的启发，构建了一个中心数多于训练样本的径向基函数 (RBF) 空间，并通过生成器网络利用零空间。引入的潜在变量提供了一种原则性机制，可以在训练期间对结构关系和分布先验进行编码，从而引导代理找到具有物理意义的解决方案。数值研究表明，RBF-Gen 在数据稀缺环境中的 1D 和 2D 结构优化问题上显着优于标准 RBF 替代方法，并且在现实世界的半导体制造数据集上实现了卓越的预测准确性。这些结果凸显了将有限的实验数据与领域专业知识相结合的潜力，可以在机械和工艺设计问题中实现准确且实用的替代建模。

Title: VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation

Authors: Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo, Tomoya Yamanokuchi, Takamitsu Matsubara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00116
Pdf URL: https://arxiv.org/pdf/2603.00116
Copy Paste: [[2603.00116]] VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation(https://arxiv.org/abs/2603.00116)
Keywords: generative
Abstract: Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part's presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.
摘要：通过切割周围结构来无损提取目标内部部件（例如电池和电机）对于回收和处置场所至关重要。然而，产品的多样性和拆卸程序信息的缺乏使得决定在哪里切割变得具有挑战性。本研究探索了一种无损提取目标内部零件的方法，该方法从观察到的切割表面迭代估计内部结构，并根据估计结果制定切割计划。一个关键要求是根据部分观察估计目标部分存在的概率。然而，学习此任务的条件生成模型具有挑战性：3D 形状表示的高维性使得学习变得困难，并且传统模型（例如条件变分自动编码器）通常由于模式崩溃而无法捕获多模态预测不确定性，从而导致过度自信的预测。为了解决这些问题，我们提出了 VoxelDiffusionCut，它使用扩散模型迭代估计表示为体素的内部结构，并根据估计结果计划切割以无损提取目标内部部分。体素表示允许模型仅预测固定网格位置的属性，即组成部分的类型，使学习更容易处理。扩散模型完成了以观察到的切割表面为条件的体素表示，捕捉未观察区域的不确定性以避免错误切割。模拟实验结果表明，所提出的方法可以从观察到的切割表面估计内部结构，并利用估计的不确定性实现目标内部部分的无损提取。

Title: Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks

Authors: Sushi Rao, Jingwei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00118
Pdf URL: https://arxiv.org/pdf/2603.00118
Copy Paste: [[2603.00118]] Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks(https://arxiv.org/abs/2603.00118)
Keywords: super-resolution
Abstract: This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network's capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across $\times2$, $\times3$, and $\times4$ scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.
摘要：本文介绍了一种轻量级图像超分辨率（SR）网络，称为多尺度空间自适应注意网络（MSAAN），以解决现有 SR 方法中高重建保真度和低模型复杂度之间的常见困境。我们方法的核心是一种新颖的多尺度空间自适应注意模块（MSAA），旨在联合建模细粒度的局部细节和远程上下文依赖性。 MSAA 包含两个协同组件：全局特征调制模块 (GFM)，通过差分特征提取学习连贯纹理结构；多尺度特征聚合模块 (MFA)，使用金字塔处理自适应地融合局部到全局尺度的特征。为了进一步增强网络的能力，我们提出了一个局部增强块（LEB）来增强局部几何感知，并提出了一个特征交互式门控前馈模块（FIGFF）来改善非线性表示，同时减少通道冗余。在标准基准（Set5、Set14、B100、Urban100、Manga109）上针对 $\times2$、$\times3$ 和 $\times4$ 缩放因子进行的广泛实验表明，我们的轻量级 (MSAAN-light) 和标准 (MSAAN) 版本在 PSNR 和 SSIM 方面均实现了卓越或有竞争力的性能，同时保持了比最先进方法低得多的参数和计算成本。消融研究验证了每个组件的贡献，视觉结果表明 MSAAN 重建了更清晰的边缘和更真实的纹理。

Title: NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence

Authors: Aman Ulla
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.00122
Pdf URL: https://arxiv.org/pdf/2603.00122
Copy Paste: [[2603.00122]] NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence(https://arxiv.org/abs/2603.00122)
Keywords: generation, generative
Abstract: Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
摘要：文档提取是检索增强生成 (RAG)、知识库和下游生成人工智能发挥作用之前的重要步骤。它将非结构化文档（例如 PDF）和扫描件转换为结构化文本和布局感知的表示形式。我们推出 NovaLAD，这是一个综合文档解析系统，它集成了两个并发 YOLO 对象检测模型（元素检测和布局检测）以及基于规则的分组和可选的视觉语言增强功能。当发送页面图像时，首先发生的事情是它同时通过两个模型。元素模型查找标题、页眉、文本、表格、图像等语义内容，布局模型查找布局框、列组、多列、行组等结构区域。一个关键的设计决策是首先通过图像分类器 (ViT) 发送图像或图形，由其决定其是否相关。然后，只有有用的图像才会提交给 Vision LLM 以获取标题、摘要和结构化信息，从而减少噪音和成本。 NovaLAD 专为速度而打造：它在 CPU 上工作，采用并行执行进行检测、分类、OCR 和转换，并生成多种形式，包括结构化 JSON、Markdown、RAG 就绪文本和知识图。我们在 DP-Bench 基准测试（upstage/dp-bench）上进行测试，得到 96.49% TEDS 和 98.51% NID，这比商业和开源解析器都要好。本文解释了如何提取数据、架构如何工作、数据如何流动，以及如何在不需要 GPU 的情况下使 NovaLAD 既准确又可用。

Title: CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

Authors: Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00123
Pdf URL: https://arxiv.org/pdf/2603.00123
Copy Paste: [[2603.00123]] CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers(https://arxiv.org/abs/2603.00123)
Keywords: generation
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
摘要：大视觉语言模型 (LVLM) 的最新进展显示出多模式放射学推理的强大潜力，特别是在诊断视觉问答 (VQA) 和放射学报告生成等任务中。然而，大多数现有 3D CT 分析方法很大程度上依赖于静态、单通道推理。在实践中，临床判读是一个动态的、工具介导的工作流程，放射科医生反复审查切片并使用测量、放射组学和分割工具来完善结果。为了弥补这一差距，我们提出了 CT-Flow，这是一种专为可互操作的体积解释而设计的代理框架。通过利用模型上下文协议 (MCP)，CT-Flow 从闭箱推理转变为开放的、工具感知的范式。我们策划了 CT-FlowBench，这是第一个为 3D CT 工具使用和多步推理量身定制的大规模指令调优基准。在此基础上，CT-Flow 充当临床协调器，能够将复杂的自然语言查询分解为自动化工具使用序列。对 CT-FlowBench 和标准 3D VQA 数据集的实验评估表明，CT-Flow 实现了最先进的性能，在诊断准确性方面超越基线模型 41%，在自主工具调用方面实现了 95% 的成功率。这项工作为将自主代理智能集成到现实世界的临床放射学中提供了可扩展的基础。

Title: You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

Authors: Kairan Zhao, Eleni Triantafillou, Peter Triantafillou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00133
Pdf URL: https://arxiv.org/pdf/2603.00133
Copy Paste: [[2603.00133]] You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models(https://arxiv.org/abs/2603.00133)
Keywords: generation, generative
Abstract: Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.
摘要：生成模型已被证明可以“记住”某些训练数据，从而导致逐字或接近逐字生成图像，这可能会导致隐私问题或版权侵权。我们引入了使用吸引-排斥动力学（GUARD）的指导，这是一种用于文本到图像扩散模型中缓解记忆的新颖框架。 GUARD 调整图像去噪过程，以引导生成远离原始训练图像，转向与训练数据不同的图像，同时保持与提示一致，防止复制训练数据，而不会损害图像生成质量。我们提出了该框架的具体实例，其中我们引导的积极目标是通过一种新颖的（交叉）注意力衰减方法给出的，该方法基于（i）一种新颖的统计机制，该机制自动识别交叉注意力必须减弱的提示位置，以及（ii）减弱这些每个提示位置的交叉注意力。由此产生的 GUARD 提供了一种外科手术式、动态的按提示推理时间方法，我们发现，就持续产生最先进的结果而言，该方法是迄今为止最强大的方法，用于跨两种架构以及逐字记忆和模板记忆的记忆缓解，同时在图像质量方面也有所改进或产生可比较的结果。

Title: Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Authors: Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00140
Pdf URL: https://arxiv.org/pdf/2603.00140
Copy Paste: [[2603.00140]] Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion(https://arxiv.org/abs/2603.00140)
Keywords: generation
Abstract: Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: this https URL.
摘要：文本到图像的扩散模型通常会记住训练数据，这揭示了在训练集之外泛化的根本失败。当前的缓解策略通常会牺牲图像质量或提示对齐以减少记忆。为了解决这个问题，我们提出了可达性感知扩散引导（RADS），这是一种推理时间框架，可以在保持生成保真度的同时防止记忆。 RADS 将扩散降噪过程建模为动态系统，并应用可达性分析中的概念来近似“向后可达管”——不可避免地演变成记忆样本的一组中间状态。然后，我们将缓解措施制定为约束强化学习（RL）问题，其中策略学习通过标题嵌入空间中的最小扰动来引导轨迹远离记忆。实证评估表明，与最先进的基线相比，RADS 在世代多样性 (SSCD)、质量 (FID) 和一致性 (CLIP) 之间实现了优越的帕累托前沿。至关重要的是，RADS 在不修改扩散主干的情况下提供了强大的缓解措施，为安全发电提供了即插即用的解决方案。我们的网站位于：此 https URL。

Title: From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Authors: Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, Yujun Cai
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2603.00141
Pdf URL: https://arxiv.org/pdf/2603.00141
Copy Paste: [[2603.00141]] From Scale to Speed: Adaptive Test-Time Scaling for Image Editing(https://arxiv.org/abs/2603.00141)
Keywords: generation
Abstract: Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
摘要：图像思维链 (Image-CoT) 是一种测试时间缩放范例，可通过延长推理时间来改进图像生成。大多数 Image-CoT 方法专注于文本到图像 (T2I) 的生成。与 T2I 生成不同，图像编辑是目标导向的：解决方案空间受到源图像和指令的约束。这种不匹配在将 Image-CoT 应用于编辑时带来了三个挑战：固定采样预算的资源分配效率低下、使用一般 MLLM 分数进行的不可靠的早期验证以及大规模采样的冗余编辑结果。为了解决这个问题，我们提出了 ADaptive Edit-CoT (ADE-CoT)，这是一种按需测试时间扩展框架，用于提高编辑效率和性能。它包含三个关键策略：（1）难度感知资源分配，根据估计的编辑难度分配动态预算； (2) 早期修剪中的特定于编辑的验证，使用区域本地化和标题一致性来选择有希望的候选者； (3) 深度优先机会停止，由特定于实例的验证器引导，当发现意图一致的结果时终止。在三个基准测试中对三个 SOTA 编辑模型（Step1X-Edit、BAGEL、FLUX.1 Kontext）进行的广泛实验表明，ADE-CoT 实现了卓越的性能与效率权衡。在采样预算相当的情况下，ADE-CoT 获得了更好的性能，比 Best-of-N 加速超过 2 倍。

Title: Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Authors: Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, Ajmal Mian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00144
Pdf URL: https://arxiv.org/pdf/2603.00144
Copy Paste: [[2603.00144]] Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation(https://arxiv.org/abs/2603.00144)
Keywords: generation
Abstract: Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.
摘要：生成逼真的 3D 人机交互 (HHI) 需要对代理的物理合理性及其交互语义进行连贯建模。现有方法将所有运动信息压缩为单个潜在表示，限制了它们捕获细粒度动作和代理间交互的能力。这通常会导致语义错位和物理上不可信的伪影，例如穿透或错过接触。我们提出基于潜在扩散的解缠结分层变分自动编码器（DHVAE），用于结构化且可控的 HHI 生成。通过使用 CoTransformer 模块，DHVAE 将全局交互上下文和个体运动模式明确地分解为解耦的潜在结构。为了减少 HHI 中不合理且物理上不一致的接触，我们将对比学习约束与 DHVAE 结合起来，以促进更具辨别力和物理上合理的潜在交互空间。对于高保真交互合成，DHVAE 在分层潜在空间中采用基于 DDIM 的扩散去噪过程，并通过跳跃连接的 AdaLN-Transformer 去噪器进行增强。广泛的评估表明，DHVAE 实现了卓越的运动保真度、文本对齐和物理合理性以及更高的计算效率。

Title: Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction

Authors: Zhihao Li, Shengwei Dong, Chuang Yi, Junxuan Gao, Zhilu Lai, Zhiqiang Liu, Wei Wang, Guangtao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00149
Pdf URL: https://arxiv.org/pdf/2603.00149
Copy Paste: [[2603.00149]] Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction(https://arxiv.org/abs/2603.00149)
Keywords: super-resolution
Abstract: Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on this https URL.
摘要：现有的图像 SR 和通用扩散模型很难迁移到流体 SR：它们是采样密集型的，忽略物理约束，并且经常产生光谱不匹配和虚假发散。我们使用物理一致的扩散框架 \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion) 来解决流体超分辨率（SR）问题。在每个反向步骤中，ReMD都会执行\emph{多重网格残差校正}：通过将数据一致性与轻量级物理线索耦合来获得更新方向，然后跨尺度校正残差；多尺度层次结构以 \emph{multi-wavelet} 为基础进行实例化，以捕获大型结构和精细的涡旋细节。这种从粗到精的设计可加速收敛并保留精细结构，同时保持无方程。在大气和海洋基准中，ReMD 提高了准确性和光谱保真度，减少了发散，并以比扩散基线明显更少的采样步骤达到了可比的质量。我们的结果表明，通过多网格残差校正和多小波多尺度建模来增强扩散过程中的物理一致性\emph{内部}是实现高效流体SR的有效途径。我们的代码可以在此 https URL 上找到。

Title: EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection

Authors: Wenxin Tang, Jingyu Xiao, Yanpei Gong, Fengyuan Ran, Tongchuan Xia, Junliang Liu, Man Ho Lam, Wenxuan Wang, Michael R. Lyu
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.00155
Pdf URL: https://arxiv.org/pdf/2603.00155
Copy Paste: [[2603.00155]] EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection(https://arxiv.org/abs/2603.00155)
Keywords: generation
Abstract: Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at this https URL.
摘要：自动生成学术海报旨在将冗长的研究论文提炼成简洁、视觉连贯的演示文稿。然而，现有的基于多模态大型语言模型 (MLLM) 的方法存在三个关键限制：整篇论文输入的信息密度低、令牌消耗过多以及布局验证不可靠。我们提出了 EfficientPosterGen，这是一个端到端框架，它通过语义感知检索和令牌高效的多模式生成来解决这些挑战。 EfficientPosterGen引入了三个核心创新：（1）语义感知关键信息检索（SKIR），构建语义贡献图来建模段间关系并选择性地保留重要内容； (2) 基于视觉的上下文压缩（VCC），将选定的文本片段渲染为图像，将文本信息转换为视觉模态，显着减少标记使用，同时生成海报就绪的要点； (3) 无代理布局违规检测 (ALVD)，这是一种基于确定性颜色梯度的算法，无需辅助 MLLM 即可可靠地检测内容溢出和空间稀疏性。大量实验表明，EfficientPosterGen 在保持高海报质量的同时，在令牌效率和布局可靠性方面实现了显着提高，为自动化学术海报生成提供了可扩展的解决方案。我们的代码可以在这个 https URL 上找到。

Title: FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Authors: Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu
Subjects: cs.CV, cs.AI, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.00159
Pdf URL: https://arxiv.org/pdf/2603.00159
Copy Paste: [[2603.00159]] FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation(https://arxiv.org/abs/2603.00159)
Keywords: generation
Abstract: Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
摘要：由于唇形同步不完美、动作不自然以及与人类感知相关性较差等评估指标等持续存在的问题，生成逼真的头部说话视频仍然具有挑战性。我们提出了 FlowPortrait，这是一种音频驱动肖像动画的强化学习框架，构建在自回归音频到视频生成的多模式主干上。 FlowPortrait 引入了基于多模态大语言模型 (MLLM) 的人性化评估系统，用于评估口型同步准确性、表现力和运动质量。这些信号与感知和时间一致性正则化器相结合，形成稳定的复合奖励，用于通过组相对策略优化（GRPO）对生成器进行后训练。包括自动评估和人类偏好研究在内的大量实验表明，FlowPortrait 始终能够生成更高质量的头像视频，凸显了强化学习对于肖像动画的有效性。

Title: SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision

Authors: S. Kalaycioglu, C. Hong, M. Zhu, H. Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00161
Pdf URL: https://arxiv.org/pdf/2603.00161
Copy Paste: [[2603.00161]] SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision(https://arxiv.org/abs/2603.00161)
Keywords: generation
Abstract: Early ophthalmic screening in low-resource and remote settings is constrained by access to specialized equipment and trained practitioners. We present SKINOPATHY AI, a smartphone-first web application that delivers five complementary, explainable screening modules entirely through commodity mobile hardware: (1) redness quantification via LAB a* color-space normalization; (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio (EAR) with adaptive thresholding; (3) pupil light reflex characterization through Pupil-to-Iris Ratio (PIR) time-series analysis; (4) scleral color indexing foricterus and anemia proxies via LAB/HSV statistics; and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates and longitudinal trend tracking. The system is implemented as a React/FastAPI stack with OpenCV and MediaPipe, MongoDB-backed session persistence, and PDF report generation. All algorithms are fully deterministic, privacy-preserving, and designed for non-diagnostic consumer triage. We detail system architecture, algorithm design, evaluation methodology, clinical context, and ethical boundaries of the platform. SKINOPATHY AI demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing a foundation for future clinically validated mobile ophthalmoscopy tools.
摘要：在资源匮乏和偏远地区进行早期眼科筛查受到专业设备和训练有素的从业人员的限制。我们推出 SKINOPATHY AI，这是一款智能手机优先的网络应用程序，完全通过商用移动硬件提供五个互补的、可解释的筛选模块：(1) 通过 LAB a* 色彩空间标准化进行红度量化； (2) 使用 MediaPipe FaceMesh Eye Aspect Ratio (EAR) 和自适应阈值进行眨眼率估计； (3) 通过瞳孔与虹膜比 (PIR) 时间序列分析来表征瞳孔光反射特征； (4) 通过 LAB/HSV 统计显示黄疸和贫血指标的巩膜颜色指数； (5) 虹膜地标校准病变侵占测量，具有毫米级估计和纵向趋势跟踪。该系统以 React/FastAPI 堆栈的形式实现，具有 OpenCV 和 MediaPipe、MongoDB 支持的会话持久性以及 PDF 报告生成功能。所有算法都是完全确定性的、保护隐私的，并且是为非诊断性消费者分类而设计的。我们详细介绍了平台的系统架构、算法设计、评估方法、临床背景和道德边界。 SKINOPATHY AI 证明，多信号眼科筛查在未经修改的智能手机上是可行的，无需基于云的人工智能推理，为未来经过临床验证的移动检眼镜工具奠定了基础。

Title: Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

Authors: Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00166
Pdf URL: https://arxiv.org/pdf/2603.00166
Copy Paste: [[2603.00166]] Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?(https://arxiv.org/abs/2603.00166)
Keywords: generation, generative
Abstract: Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit "Paradox of Simplicity": while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.
摘要：生成式人工智能的最新进展已经证明了生成高质量内容的卓越能力。然而，这些模型经常表现出“简单性悖论”：虽然它们可以渲染复杂的景观，但它们常常无法完成简单的确定性任务。为了解决这个问题，我们将服从形式化为与指令对齐的能力，并建立从基本语义对齐到像素级系统精度的分层分级系统，这为合并和分类现有文献提供了统一的范例。然后，我们进行案例研究来识别常见的服从差距，揭示生成先验如何经常超越逻辑约束。为了评估高级服从性，我们推出了 VIOLIN（视觉服从性 4 级评估），这是第一个专注于跨六个变体的纯色生成的基准。 SOTA 模型的大量实验揭示了基本的服从限制和进一步的探索性见解。通过建立这个框架，我们的目标是引起人们对人工智能服从的更多关注，并鼓励更深入的探索来弥合这一差距。

Title: NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces

Authors: Jiwoo Kim, Swarajh Mehta, Hao-Lun Hsu, Hyunwoo Ryu, Yudong Liu, Miroslav Pajic
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00180
Pdf URL: https://arxiv.org/pdf/2603.00180
Copy Paste: [[2603.00180]] NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces(https://arxiv.org/abs/2603.00180)
Keywords: generation, generative
Abstract: Generative modeling of neural network parameters is often tied to architectures because standard parameter representations rely on known weight-matrix dimensions. Generation is further complicated by permutation symmetries that allow networks to model similar input-output functions while having widely different, unaligned parameterizations. In this work, we introduce Neural Network Diffusion Transformers (NNiTs), which generate weights in a width-agnostic manner by tokenizing weight matrices into patches and modeling them as locally structured fields. We establish that Graph HyperNetworks (GHNs) with a convolutional neural network (CNN) decoder structurally align the weight space, creating the local correlation necessary for patch-based processing. Focusing on MLPs, where permutation symmetry is especially apparent, NNiT generates fully functional networks across a range of architectures. Our approach jointly models discrete architecture tokens and continuous weight patches within a single sequence model. On ManiSkill3 robotics tasks, NNiT achieves >85% success on architecture topologies unseen during training, while baseline approaches fail to generalize.
摘要：神经网络参数的生成建模通常与架构相关，因为标准参数表示依赖于已知的权重矩阵维度。排列对称性使生成变得更加复杂，排列对称性允许网络对相似的输入输出函数进行建模，同时具有广泛不同的、未对齐的参数化。在这项工作中，我们引入了神经网络扩散变压器（NNiT），它通过将权重矩阵标记为补丁并将其建模为局部结构化字段，以与宽度无关的方式生成权重。我们建立了带有卷积神经网络（CNN）解码器的图超网络（GHN），在结构上对齐权重空间，创建基于补丁的处理所需的局部相关性。 NNiT 专注于排列对称性特别明显的 MLP，跨一系列架构生成功能齐全的网络。我们的方法在单个序列模型中联合建模离散架构令牌和连续权重补丁。在 ManiSkill3 机器人任务中，NNiT 在训练期间未见的架构拓扑上取得了超过 85% 的成功，而基线方法则无法泛化。

Title: Engineering FAIR Privacy-preserving Applications that Learn Histories of Disease

Authors: Ines N. Duarte, Praphulla M. S. Bhawsar, Lee K. Mason, Jeya Balaji Balasubramanian, Daniel E. Russ, Arlindo L. Oliveira, Jonas S. Almeida
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.00181
Pdf URL: https://arxiv.org/pdf/2603.00181
Copy Paste: [[2603.00181]] Engineering FAIR Privacy-preserving Applications that Learn Histories of Disease(https://arxiv.org/abs/2603.00181)
Keywords: generation, generative
Abstract: A recent report on "Learning the natural history of human disease with generative transformers" created an opportunity to assess the engineering challenge of delivering user-facing Generative AI applications in privacy-sensitive domains. The application of these models, particularly for personalized healthcare tasks like predicting individual morbidity risk, is typically constrained by data privacy concerns. This project was accordingly designed as an in-browser model deployment exercise (an "App") testing the architectural boundaries of client-side inference generation (no downloads or installations). We relied exclusively on the documentation provided in the reference report to develop the model, specifically testing the "R" component of the FAIR data principles: Findability, Accessibility, Interoperability, and Reusability. The successful model deployment, leveraging ONNX and a custom JavaScript SDK, establishes a secure, high-performance architectural blueprint for the future of private generative AI in medicine.
摘要：最近一份关于“通过生成变压器学习人类疾病的自然史”的报告创造了一个机会来评估在隐私敏感领域提供面向用户的生成人工智能应用程序的工程挑战。这些模型的应用，特别是对于预测个人发病风险等个性化医疗任务，通常受到数据隐私问题的限制。因此，该项目被设计为浏览器内模型部署练习（“应用程序”），测试客户端推理生成的架构边界（无需下载或安装）。我们完全依赖参考报告中提供的文档来开发模型，特别是测试 FAIR 数据原则的“R”组件：可查找性、可访问性、互操作性和可重用性。成功的模型部署利用 ONNX 和自定义 JavaScript SDK，为医学领域私人生成人工智能的未来建立了安全、高性能的架构蓝图。

Title: SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models

Authors: Yang Yang, Xinze Zou, Zehua Ma, Han Fang, Weiming Zhang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.00194
Pdf URL: https://arxiv.org/pdf/2603.00194
Copy Paste: [[2603.00194]] SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models(https://arxiv.org/abs/2603.00194)
Keywords: generation, generative
Abstract: The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
摘要：文本到视频生成模型的兴起引发了人们对内容真实性、版权保护和恶意滥用的日益担忧。水印是监管此类人工智能生成内容的有效机制，其中高保真度和强鲁棒性尤为关键。最近的生成图像水印方法通过利用水印信息和伪随机密钥来控制初始采样噪声，从而实现无损嵌入，提供了有前景的基础。然而，直接将这些技术扩展到视频会带来两个关键限制：现有设计隐式依赖于视频帧和用于水印加密的帧相关伪随机二进制序列之间的严格对齐。一旦这种对齐方式被破坏，后续的水印提取就会变得不可靠；视频特定的失真（例如帧间压缩）会显着降低水印的可靠性。为了解决这些问题，我们提出了 SKeDA，这是一种专为文本到视频扩散模型量身定制的生成水印框架。 SKeDA由两个部分组成：（1）基于随机密钥的分布保持采样（SKe）采用单基伪随机二进制序列进行水印加密，并通过排列导出帧级加密序列。该设计将水印提取从同步敏感序列解码转变为排列容忍的集合级聚合，大大提高了针对帧重排序和丢失的鲁棒性；（2）差分注意力（DA），它计算帧间差异并在提取过程中动态调整注意力权重，增强针对时间扭曲的鲁棒性。大量实验表明 SKeDA 保持了较高的视频生成质量和水印鲁棒性。

Title: TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Authors: Daniel Nobrega Medeiros
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00206
Pdf URL: https://arxiv.org/pdf/2603.00206
Copy Paste: [[2603.00206]] TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models(https://arxiv.org/abs/2603.00206)
Keywords: generation, generative
Abstract: Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: https://doi.org/10.57967/hf/7904).
摘要：现有的视觉推理基准主要依赖于自然语言提示，评估狭隘的推理模式，或依赖于主观评分程序，例如法学硕士作为法官。我们引入了 TACIT Benchmark，这是一个程序化视觉推理基准，包含 6 个推理领域的 10 项任务：空间导航、抽象模式完成、因果模拟、逻辑约束满足、图论和拓扑。该基准提供双轨评估：一个生成轨道，其中模型必须生成通过确定性计算机视觉管道验证的解决方案图像，以及一个判别轨道，提供五向多项选择，并具有结构合理的近乎失误干扰因素。每个干扰因素都违反了一个结构约束，要求模型推理细粒度的视觉差异，而不是利用表面的线索。 0.1.0 版本分发了 6,000 个谜题（三种分辨率的 108,000 张 PNG 图像），具有完全确定性的种子生成和可重复的验证。数据集、生成代码和评估工具根据 Apache 2.0 许可证在 HuggingFace 上发布（DOI：https://doi.org/10.57967/hf/7904）。

Title: VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

Authors: Soumya Suvra Ghosal, Youngeun Kim, Zhuowei Li, Ritwick Chaudhry, Linghan Xu, Hongjing Zhang, Jakub Zablocki, Yifan Xing, Qin Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00207
Pdf URL: https://arxiv.org/pdf/2603.00207
Copy Paste: [[2603.00207]] VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models(https://arxiv.org/abs/2603.00207)
Keywords: generation
Abstract: Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
摘要：大型推理模型的进步通过扩展推理扩展测试时计算，在复杂推理任务上表现出了强大的性能。然而，最近的研究发现，在依赖于视觉的任务中，推理时的扩展文本推理可能会降低性能，因为模型逐渐失去对视觉标记的关注，并越来越依赖于文本先验。为了解决这个问题，之前的工作使用基于强化学习（RL）的微调来路由视觉标记或在推理过程中采用重新聚焦机制。虽然有效，但这些方法的计算成本很高，需要大规模数据生成和策略优化。为了利用测试时计算的优势而无需额外的 RL 微调，我们提出了 VisRef，一种基于视觉的测试时缩放框架。我们的关键思想是通过重新注入与推理上下文语义相关的视觉标记核心集来积极引导推理过程，同时保持图像的多样性和全局代表性，从而实现更扎实的多模态推理。使用最先进的多模态大型推理模型对三个视觉推理基准进行的实验表明，在固定的测试时间计算预算下，VisRef 始终优于现有的测试时间扩展方法高达 6.4%。

Title: Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection

Authors: Brianna D'Urso, Tahmid Hasan Sakib, Syed Rafay Hasan, Terry N. Guo
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.00217
Pdf URL: https://arxiv.org/pdf/2603.00217
Copy Paste: [[2603.00217]] Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection(https://arxiv.org/abs/2603.00217)
Keywords: generative
Abstract: This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector's STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.
摘要：本文研究了当检测器在自动驾驶车辆 (AV) 环境的定制数据集上进行训练时，自然对抗补丁 (NAP) 转移到物理交通标志设置的效果如何。我们通过将德国交通标志识别基准 (GTSRB) 中的交通标志实例粘贴到从目标平台捕获的未失真背景上，构建了一个复合数据集 CompGTSRB（这是针对 AV 环境的定制数据集）。 CompGTSRB 用于训练 YOLOv5 模型，并使用具有潜在空间优化的生成对抗网络 (GAN) 生成补丁，遵循现有的 NAP 方法。我们利用 QCar 中提供的前置 CSI 摄像头在 Quanser QCar 测试台上进行了一系列实验。在所有配置中，NAP 都会降低探测器的 STOP 类别置信度。不同的配置包括距离、贴片尺寸和贴片放置。这些结果以及详细的分步方法表明了 CompGTSRB 数据集的实用性以及所提出的用于可信补丁评估的系统物理协议。该研究进一步激发了研究解决嵌入式感知管道中局部补丁损坏的防御措施。

Title: Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization

Authors: He Li, Wenyue He, Weihang Kong, Xingchen Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00266
Pdf URL: https://arxiv.org/pdf/2603.00266
Copy Paste: [[2603.00266]] Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization(https://arxiv.org/abs/2603.00266)
Keywords: generation
Abstract: Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
摘要：用于密集预测的多模式对抗攻击在很大程度上仍未得到充分探索。特别是，由于异构光谱特性和特定模态的强度分布，视觉-红外 (VI) 感知系统带来了独特的挑战。现有的对抗性补丁方法主要针对单模态输入而设计，无法解决跨谱不一致性，导致应用于 VI 密集预测模型时攻击有效性降低且隐秘性较差。为了应对这些挑战，我们提出了一种联合位置颜色优化框架（AP-PCO），用于在视觉-红外设置中生成对抗性补丁。所提出的方法使用从模型输出导出的适应度函数同时优化补丁放置和颜色组合，使单个补丁能够干扰可见光和红外模态。为了进一步弥合光谱差异，我们引入了一种跨模态颜色适应策略，该策略根据红外灰度特性限制斑块外观，同时保持可见域中的强烈扰动，从而减少跨光谱显着性。优化过程无需内部模型信息即可运行，支持灵活的黑盒攻击。对视觉-红外密集预测任务的大量实验表明，所提出的 AP-PCO 在多种架构中实现了一致的强大攻击性能，为 VI 感知系统的鲁棒性评估提供了实用的基准。

Title: Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance

Authors: Galen Pogoncheff, Alvin Wang, Jacob Granley, Michael Beyeler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00362
Pdf URL: https://arxiv.org/pdf/2603.00362
Copy Paste: [[2603.00362]] Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance(https://arxiv.org/abs/2603.00362)
Keywords: generation
Abstract: Cortical visual prostheses aim to restore sight by electrically stimulating neurons in early visual cortex (V1). With the emergence of high-density and flexible neural interfaces, electrode placement within three-dimensional cortex has become a critical surgical planning problem. Existing strategies emphasize visual field coverage and anatomical heuristics but do not directly optimize predicted perceptual outcomes under safety constraints. We present a percept-aware framework for surgical planning of cortical visual prostheses that formulates electrode placement as a constrained optimization problem in anatomical space. Electrode coordinates are treated as learnable parameters and optimized end-to-end using a differentiable forward model of prosthetic vision. The objective minimizes task-level perceptual error while incorporating vascular avoidance and gray matter feasibility constraints. Evaluated on simulated reading and natural image tasks using realistic folded cortical geometry (FreeSurfer fsaverage), percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based placement strategies. Importantly, vascular safety constraints eliminate margin violations while preserving perceptual performance. The framework further enables co-optimization of multi-electrode thread configurations under fixed insertion budgets. These results demonstrate how differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces and provide a foundation for optimizing next-generation visual prostheses.
摘要：皮层视觉假体旨在通过电刺激早期视觉皮层（V1）的神经元来恢复视力。随着高密度和灵活的神经接口的出现，三维皮层内的电极放置已成为一个关键的手术计划问题。现有策略强调视野覆盖和解剖启发式，但没有直接优化安全约束下的预测感知结果。我们提出了一种用于皮层视觉假体手术规划的感知感知框架，该框架将电极放置制定为解剖空间中的约束优化问题。电极坐标被视为可学习参数，并使用假肢视觉的可微正向模型进行端到端优化。该目标最大限度地减少任务级感知误差，同时结合血管回避和灰质可行性约束。使用真实的折叠皮质几何结构（FreeSurfer fsaverage）对模拟阅读和自然图像任务进行评估，感知感知优化相对于基于覆盖的放置策略持续提高了重建保真度。重要的是，血管安全约束消除了边界违规，同时保留了感知性能。该框架进一步实现了固定插入预算下多电极线程配置的共同优化。这些结果证明了可微分感知模型如何为皮层神经接口提供基于解剖学、安全意识的计算机辅助规划，并为优化下一代视觉假体提供基础。

Title: DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography

Authors: Yujia Wu, Shuoqi Chen, Shiru Wang, Yucheng Tang, Petr Bruza, Geoffrey P. Luke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00382
Pdf URL: https://arxiv.org/pdf/2603.00382
Copy Paste: [[2603.00382]] DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography(https://arxiv.org/abs/2603.00382)
Keywords: generative
Abstract: Accurate Speed-of-Sound (SoS) reconstruction from acoustic waveforms is a cornerstone of ultrasound computed tomography (USCT), enabling quantitative velocity mapping that reveals subtle anatomical details and pathological variations often invisible in conventional imaging. However, practical utility is hindered by the limitations of existing algorithms; traditional Full Waveform Inversion (FWI) is computationally intensive, while current deep learning approaches tend to produce oversmoothed results lacking fine details. We propose DiffSOS, a conditional diffusion model that directly maps acoustic waveforms to SoS maps. Our framework employs a specialized acoustic ControlNet to strictly ground the denoising process in physical wave measurements. To ensure structural consistency, we optimize a hybrid loss function that integrates noise prediction, spatial reconstruction, and noise frequency content. To accelerate inference, we employ stochastic Denoising Diffusion Implicit Model (DDIM) sampling, achieving near real-time reconstruction with only 10 steps. Crucially, we exploit the stochastic generative nature of our framework to estimate pixel-wise uncertainty, providing a measure of reliability that is often absent in deterministic approaches. Evaluated on the OpenPros USCT benchmark, DiffSOS significantly outperforms state-of-the-art networks, achieving an average Multi-scale Structural Similarity of 0.957. Our approach provides high-fidelity SoS maps with a principled measure of confidence, facilitating safer and faster clinical interpretation.
摘要：根据声波波形进行精确的声速 (SoS) 重建是超声计算机断层扫描 (USCT) 的基石，它能够实现定量速度测绘，从而揭示传统成像中通常不可见的细微解剖细节和病理变化。然而，现有算法的局限性阻碍了实用性；传统的全波形反演（FWI）计算量大，而当前的深度学习方法往往会产生缺乏细节的过度平滑结果。我们提出了 DiffSOS，一种条件扩散模型，可直接将声学波形映射到 SoS 地图。我们的框架采用专门的声学 ControlNet 来严格保证物理波测量中的降噪过程。为了确保结构一致性，我们优化了集成噪声预测、空间重建和噪声频率内容的混合损失函数。为了加速推理，我们采用随机去噪扩散隐式模型 (DDIM) 采样，只需 10 个步骤即可实现近实时重建。至关重要的是，我们利用框架的随机生成性质来估计像素方面的不确定性，提供确定性方法中通常缺乏的可靠性度量。根据 OpenPros USCT 基准进行评估，DiffSOS 的性能显着优于最先进的网络，平均多尺度结构相似度达到 0.957。我们的方法提供具有原则性置信度的高保真 SoS 地图，促进更安全、更快速的临床解释。

Title: SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

Authors: Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu, Wenjun Wan, Bingbing Liu, AiXue Ye, Hongbo Zhang, Feng Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00409
Pdf URL: https://arxiv.org/pdf/2603.00409
Copy Paste: [[2603.00409]] SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning(https://arxiv.org/abs/2603.00409)
Keywords: generation
Abstract: While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling this http URL introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
摘要：虽然多模态大型语言模型 (MLLM) 在语义任务中表现出色，但它们经常缺乏复杂几何推理所必需的“空间感”。当前的模型通常面临过高的模态对齐成本和细粒度结构建模的缺陷，该 http URL 介绍了 SSR，这是一个为结构化场景推理设计的框架，可通过轻量级对齐机制无缝集成 2D 和 3D 表示。为了最大限度地减少训练开销，我们的框架通过跨模态添加和令牌交错将 3D 几何特征锚定到大型语言模型的预对齐 2D 视觉语义，有效地消除了大规模对齐预训练的必要性。为了支持复杂的空间推理，我们提出了一种新颖的场景图生成管道，它将全局布局表示为由相对坐标定义的独立局部三元组链。增量生成算法对此进行了补充，使模型能够为复杂环境构建“语言模型友好”的结构支架。此外，我们将这些功能扩展到全球规模的 3D 全局接地任务，实现跨异构数据源的绝对度量精度。在 7B 参数范围内，SSR 在多个空间智能基准上实现了最先进的性能，特别是在 VSI-Bench 上得分为 73.9。我们的方法明显优于更大的模型，这表明高效的特征对齐和结构化场景推理是真正的空间智能的基石。

Title: Station2Radar: query conditioned gaussian splatting for precipitation field

Authors: Doyi Kim, Minseok Seo, Changick Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00418
Pdf URL: https://arxiv.org/pdf/2603.00418
Copy Paste: [[2603.00418]] Station2Radar: query conditioned gaussian splatting for precipitation field(https://arxiv.org/abs/2603.00418)
Keywords: generation
Abstract: Precipitation forecasting relies on heterogeneous data. Weather radar is accurate, but coverage is geographically limited and costly to maintain. Weather stations provide accurate but sparse point measurements, while satellites offer dense, high-resolution coverage without direct rainfall retrieval. To overcome these limitations, we propose Query-Conditioned Gaussian Splatting (QCGS), the first framework to fuse automatic weather station (AWS) observations with satellite imagery for generating precipitation fields. Unlike conventional 2D Gaussian splatting, which renders the entire image plane, QCGS selectively renders only queried precipitation regions, avoiding unnecessary computation in non-precipitating areas while preserving sharp precipitation structures. The framework combines a radar point proposal network that identifies rainfall-support locations with an implicit neural representation (INR) network that predicts Gaussian parameters for each point. QCGS enables efficient, resolution-flexible precipitation field generation in real time. Through extensive evaluation with benchmark precipitation products, QCGS demonstrates over 50\% improvement in RMSE compared to conventional gridded precipitation products, and consistently maintains high performance across multiple spatiotemporal scales.
摘要：降水预报依赖于异构数据。天气雷达很准确，但覆盖范围受地理限制且维护成本高昂。气象站提供准确但稀疏的点测量，而卫星提供密集、高分辨率的覆盖范围，无需直接检索降雨量。为了克服这些限制，我们提出了查询条件高斯分布（QCGS），这是第一个将自动气象站（AWS）观测与卫星图像融合以生成降水场的框架。与渲染整个图像平面的传统二维高斯泼溅不同，QCGS 有选择地仅渲染查询的降水区域，避免在非降水区域进行不必要的计算，同时保留清晰的降水结构。该框架结合了识别降雨支持位置的雷达点提议网络和预测每个点的高斯参数的隐式神经表示（INR）网络。 QCGS 可实时生成高效、分辨率灵活的降水场。通过对基准降水产品的广泛评估，QCGS 与传统网格降水产品相比，RMSE 提高了 50% 以上，并且在多个时空尺度上始终保持高性能。

Title: An Interpretable Local Editing Model for Counterfactual Medical Image Generation

Authors: Hyungi Min, Taeseung You, Hangyeul Lee, Yeongjae Cho, Sungzoon Cho
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00423
Pdf URL: https://arxiv.org/pdf/2603.00423
Copy Paste: [[2603.00423]] An Interpretable Local Editing Model for Counterfactual Medical Image Generation(https://arxiv.org/abs/2603.00423)
Keywords: generation
Abstract: Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering "what-if" questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.
摘要：反事实医学图像生成已成为通过回答“假设”问题来增强医学领域人工智能驱动系统的关键工具。然而，现有方法面临两个基本局限性：首先，它们无法防止意外修改，导致人口统计属性发生附带变化，而仅影响疾病特征。其次，它们在编辑过程中缺乏可解释性，这极大地限制了它们在现实世界医疗应用中的效用。为了解决这些限制，我们提出了 InstructX2X，这是一种新颖的可解释本地编辑模型，用于反事实医学图像生成，具有区域特定编辑功能。这种方法限制对特定区域的修改，有效防止意外更改，同时提供指导图，为编辑过程提供本质上可解释的视觉解释。此外，我们还引入了 MIMIC-EDIT-INSTRUCTION，这是一个根据专家验证的医学 VQA 对生成反事实医学图像的数据集。通过大量实验，InstructX2X 在所有主要评估指标上均实现了最先进的性能。我们的模型成功生成了高质量的反事实胸部 X 射线图像以及可解释的解释。

Title: Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models

Authors: April Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00437
Pdf URL: https://arxiv.org/pdf/2603.00437
Copy Paste: [[2603.00437]] Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models(https://arxiv.org/abs/2603.00437)
Keywords: generation
Abstract: Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.
摘要：尽管大视觉语言模型（LVLM）已经取得了实质性进展，但幻觉（生成的文本不基于视觉输入）仍然是一个挑战。随着 LVLM 变得更强，之前报告的幻觉模式，例如语言偏见和过度思考现象，变得越来越不一致，使得相应的缓解技术的效果大大降低。在本文中，我们介绍了一种利用层注意力（ICLA）的内部自校正机制，该机制在生成过程中直接对隐藏状态进行操作。每层通过对角跨层注意机制选择性地检索来自所有先前层的信息，从而无需任何外部校正信号即可进行自我细化。通过在 LLaVA1.5-7B 和 Qwen2.5-VL-7B 上引入和训练仅 0.2M 和 0.1M 附加参数，我们的模型持续改善了多个幻觉基准的视觉基础，证明了其对更先进的 LVLM 的有效性。

Title: Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling

Authors: Xueyang Li, Yunzhong Lou, Yu Song, Xiangdong Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00439
Pdf URL: https://arxiv.org/pdf/2603.00439
Copy Paste: [[2603.00439]] Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling(https://arxiv.org/abs/2603.00439)
Keywords: generation, generative
Abstract: Computer-Aided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from this https URL.
摘要：计算机辅助设计（CAD）生成建模在业界有着强大而长期的应用。近年来，参数化CAD序列作为对象的设计逻辑已被序列模型广泛挖掘。然而，工业 CAD 模型（尤其是组件对象）细粒度且复杂，需要更长的参数化 CAD 序列来定义。为了解决这个问题，我们引入了 Mamba-CAD，这是一种针对行业中复杂 CAD 模型的自监督生成建模，它可以对更长的参数化 CAD 序列进行建模。具体来说，我们首先设计一个基于 Mamba 架构的编码器-解码器框架，并将其与 CAD 重建任务配对进行预训练，以对 CAD 模型的潜在表示进行建模；然后，我们利用学习到的表示来指导生成对抗网络生成 CAD 模型的假表示，最终通过 MambaCAD 解码器将其恢复为参数化 CAD 序列。为了训练 Mamba-CAD，我们进一步创建了一个新数据集，其中包含 77,078 个 CAD 模型以及更长的参数化 CAD 序列。进行了全面的实验来证明我们的模型在各种评估指标下的有效性，特别是在有效参数化 CAD 序列的生成长度方面。代码和数据集可以从此 https URL 获取。

Title: SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

Authors: Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui, Anyi Rao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00443
Pdf URL: https://arxiv.org/pdf/2603.00443
Copy Paste: [[2603.00443]] SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment(https://arxiv.org/abs/2603.00443)
Keywords: generation, generative
Abstract: Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
摘要：最近关于 3D 手部重建的研究证明了合成训练数据对于提高估计性能的有效性。然而，大多数方法依赖游戏引擎来合成手部图像，这些图像通常缺乏纹理和环境的多样性，并且无法包含手臂或交互对象等关键组件。生成模型是生成不同手部图像的有前途的替代方案，但仍然存在未对准问题。在本文中，我们提出了 SesaHand，它从语义和结构对齐角度增强了可控手部图像生成，以进行 3D 手部重建。具体来说，对于语义对齐，我们提出了一种具有思想链推理的管道，用于从视觉语言模型生成的图像标题中提取人类行为语义。这种语义抑制了与人类无关的环境细节，并确保有足够的以人类为中心的上下文来生成手部图像。对于结构对齐，我们引入分层结构融合来集成不同粒度的结构信息以进行特征细化，以更好地对齐生成图像中的手和整体人体。我们进一步提出了一种手部结构注意力增强方法，以有效增强模型对手部区域的注意力。实验表明，我们的方法不仅在生成性能方面优于先前的工作，而且还改进了生成的手部图像的 3D 手部重建。

Title: Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

Authors: Xi Wang, Wenbo Lu, Shengjie Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00454
Pdf URL: https://arxiv.org/pdf/2603.00454
Copy Paste: [[2603.00454]] Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training(https://arxiv.org/abs/2603.00454)
Keywords: generation, generative
Abstract: Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance RapTB, an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.
摘要：生成流网络（GFlowNets）能够微调大型语言模型以近似奖励比例后验，但它们仍然容易出现模式崩溃，表现为前缀崩溃和长度偏差。我们将此归因于两个因素：（i）早期前缀的信用分配较弱，以及（ii）有偏见的重播会导致训练流分布发生变化，不具代表性。我们提出了根吸收前缀轨迹平衡 RapTB，该目标将子轨迹监督锚定在根，并通过基于吸收后缀的备份将终端奖励传播到中间前缀，从而提供密集的前缀级学习信号。为了减轻重放引起的分布变化，我们进一步引入了 SubM，这是一种子模块重放刷新策略，可促进高奖励和多样性。根据经验，在诸如使用 SMILES 字符串通过 LLM 生成分子等任务中，RapTB 与 SubM 相结合不断提高优化性能和分子多样性，同时保持高有效性。

Title: Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution

Authors: Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li, Li Zhang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00458
Pdf URL: https://arxiv.org/pdf/2603.00458
Copy Paste: [[2603.00458]] Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution(https://arxiv.org/abs/2603.00458)
Keywords: super-resolution, generation
Abstract: While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
摘要：虽然许多扩散模型通过生成丰富而真实的细节在现实视频超分辨率（Real-VSR）中取得了令人印象深刻的结果，但它们对多步采样的依赖导致推理速度缓慢。像 SeedVR2、DOVE 和 DLoRAL 这样的单步网络通过将生成压缩为一步来缓解这一问题，但它们仍然很重，具有数十亿个参数和数秒的延迟。最近的对抗性扩散压缩（ADC）通过将这些模型修剪和提炼成紧凑的 AdcSR 网络提供了一条有前途的路径，但由于缺乏时间意识和标准对抗性学习的局限性，直接将其应用于 Real-VSR 无法平衡空间细节和时间一致性。 To address these challenges, we propose an improved ADC method for Real-VSR.我们的方法将配备 3D 时空注意力的大型扩散 Transformer (DiT) 教师 DOVE 提炼为经过修剪的基于 2D 稳定扩散 (SD) 的 AdcSR 主干网，并通过轻量级 1D 时间卷积进行增强，从而实现了显着更高的效率。此外，我们引入了一种双头对抗性蒸馏方案，其中像素域和特征域中的判别器将细节和一致性的判别明确地分解为两个头，从而使两个目标都能得到有效优化，而无需牺牲一个目标。实验表明，所得到的压缩 AdcVSR 模型将参数复杂度降低了 95%，并比 DiT 老师 DOVE 实现了 8 倍加速，同时保持了有竞争力的视频质量和效率。

Title: ReMoT: Reinforcement Learning with Motion Contrast Triplets

Authors: Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00461
Pdf URL: https://arxiv.org/pdf/2603.00461
Copy Paste: [[2603.00461]] ReMoT: Reinforcement Learning with Motion Contrast Triplets(https://arxiv.org/abs/2603.00461)
Keywords: generation
Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
摘要：我们提出了 ReMoT，这是一种统一的训练范式，旨在系统地解决 VLM 在时空一致性方面的根本缺陷——这是导航、机器人和自动驾驶领域的一个关键故障点。 ReMoT 集成了两个核心组件：（1）基于规则的自动框架，可生成 ReMoT-16K，这是一个源自视频元注释的大规模（16.5K 三元组）运动对比度数据集，超越了昂贵的手动或基于模型的生成。 (2) 组相对策略优化，我们凭经验验证它可以为学习这种对比推理提供最佳性能和数据效率，远远超过标准的监督微调。我们还构建了细粒度运动对比度三元组的第一个基准，以测量 VLM 对细微运动属性（例如相反方向）的辨别力。由此产生的模型在我们的新基准和多个标准 VLM 基准上实现了最先进的性能，最终在时空推理任务上实现了 25.1% 的显着性能飞跃。

Title: DreamWorld: Unified World Modeling in Video Generation

Authors: Boming Tan, Xiangdong Zhang, Ning Liao, Yuqing Zhang, Shaofeng Zhang, Xue Yang, Qi Fan, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00466
Pdf URL: https://arxiv.org/pdf/2603.00466
Copy Paste: [[2603.00466]] DreamWorld: Unified World Modeling in Video Generation(https://arxiv.org/abs/2603.00466)
Keywords: generation
Abstract: Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{this https URL}{\textcolor{mypink}{\textbf{Github}}}.
摘要：尽管视频生成方面取得了令人印象深刻的进步，但现有模型仍然仅限于表面的合理性，缺乏对世界的连贯和统一的理解。先前的方法通常仅包含单一形式的与世界相关的知识，或者依赖严格的对齐策略来引入额外的知识。然而，对齐单一世界知识不足以构建需要联合建模多个异构维度（例如物理常识、3D 和时间一致性）的世界模型。为了解决这个限制，我们引入了 \textbf{DreamWorld}，一个统一的框架，通过 \textbf{联合世界建模范式} 将互补的世界知识集成到视频生成器中，联合预测基础模型中的视频像素和特征，以捕获时间动态、空间几何和语义一致性。 However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering.为了缓解这个问题，我们建议 \textit{一致约束退火（CCA）} 在训练期间逐步调节世界级约束，并提出 \textit{多源内部指导} 在推理时强制执行学习的世界先验。 Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{this https URL}{\textcolor{mypink}{\textbf{Github}}}.

Title: U-VLM: Hierarchical Vision Language Modeling for Report Generation

Authors: Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00479
Pdf URL: https://arxiv.org/pdf/2603.00479
Copy Paste: [[2603.00479]] U-VLM: Hierarchical Vision Language Modeling for Report Generation(https://arxiv.org/abs/2603.00479)
Keywords: generation
Abstract: Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at this https URL.
摘要：自动生成放射学报告是减少放射科医生工作量和提高诊断一致性的关键，但生成准确的 3D 医学成像报告仍然具有挑战性。现有的视觉语言模型面临两个限制：它们不利用分段预训练编码器，并且仅在语言模型的输入层注入视觉特征，从而丢失多尺度信息。我们提出了 U-VLM，它可以在训练和架构中实现分层视觉语言建模：(1) 从分割到分类再到报告生成的渐进训练，以及 (2) 多层视觉注入，将 U-Net 编码器特征路由到相应的语言模型层。每个训练阶段都可以利用不同的数据集，而无需统一注释。仅使用从头开始训练的 0.1B 解码器，U-VLM 在 CT-RATE（F1：0.414 vs 0.258，BLEU-mean：0.349 vs 0.305）和 AbdomenAtlas 3.0（F1：0.624 vs 0.518，用于基于分割的检测）上实现了最先进的性能，这证明了精心设计的视觉编码器预训练的优势7B+ 预训练语言模型的好处。消融研究表明，渐进式预训练显着提高了 F1，而多层注入则提高了 BLEU 平均值。代码可从此 https URL 获取。

Title: RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Authors: Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00483
Pdf URL: https://arxiv.org/pdf/2603.00483
Copy Paste: [[2603.00483]] RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment(https://arxiv.org/abs/2603.00483)
Keywords: generation
Abstract: Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at this https URL.
摘要：最近的文本到图像（T2I）扩散模型实现了显着的真实感，但忠实的提示图像对齐仍然具有挑战性，特别是对于具有多个对象、关系和细粒度属性的复杂提示。现有的免训练推理时间缩放方法依赖于固定的迭代预算，无法适应即时困难，而反射调整模型需要精心策划的反射数据集以及扩散和视觉语言模型的广泛联合微调，通常会过度拟合反射路径数据，并且缺乏跨模型的可迁移性。我们引入了 RAISE（需求自适应自我改进进化），这是一种无需训练、需求驱动的自适应 T2I 生成进化框架。 RAISE 将图像生成制定为需求驱动的自适应缩放过程，通过一系列不同的细化操作（包括提示重写、噪声重采样和指令编辑）在推理时进化候选群体。每一代都根据结构化的需求清单进行验证，使系统能够动态识别不满足的项目，并仅在需要时分配进一步的计算。这实现了自适应测试时间缩放，使计算工作量与语义查询复杂性保持一致。在 GenEval 和 DrawBench 上，RAISE 实现了最先进的对齐（总体 GenEval 为 0.94），同时比之前的缩放和反射调整基线产生更少的生成样本（减少了 30-40%）和 VLM 调用（减少了 80%），展示了高效、可泛化和与模型无关的多轮自我改进。代码可从此 https URL 获取。

Title: ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models

Authors: Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang, Yuxuan Zhang, Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Katarina Tothova, Zan Gojcic, Haithem Turki
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00492
Pdf URL: https://arxiv.org/pdf/2603.00492
Copy Paste: [[2603.00492]] ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models(https://arxiv.org/abs/2603.00492)
Keywords: generative
Abstract: Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.
摘要：3D Gaussian Splatting 等按场景优化方法提供了最先进的新颖视图合成质量，但对于观察不足的区域的推断效果较差。利用生成先验来纠正这些领域的伪影的方法很有希望，但目前存在两个缺点。首先是可扩展性，因为现有方法使用图像扩散模型或双向视频模型，这些模型在单次传递中可以生成的视图数量受到限制（因此需要昂贵的迭代蒸馏过程来保证一致性）。第二个是质量本身，因为先前工作中使用的生成器往往会产生与现有场景内容不一致的输出，并且在完全未观察到的区域中完全失败。为了解决这些问题，我们提出了一个利用两个关键见解的两阶段管道。首先，我们使用新颖的不透明度混合策略训练一个强大的双向生成模型，该策略鼓励与现有观察结果的一致性，同时保留模型在未见区域推断新颖内容的能力。其次，我们将其提炼成因果自回归模型，该模型一次生成数百个帧。该模型可以直接产生新颖的视图或充当伪监督，以简单高效的方式改进底层 3D 表示。我们广泛评估我们的方法，并证明它可以在现有方法完全失败的情况下生成合理的重建。当在通用基准数据集上进行测量时，我们大幅优于现有的所有现有基线，比先前最先进的方法高出 1-3 dB PSNR。

Title: Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

Authors: Ruoshuang Du, Xin Sun, Qiang Liu, Bowen Song, Zhongqi Chen, Weiqiang Wang, Liang Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00511
Pdf URL: https://arxiv.org/pdf/2603.00511
Copy Paste: [[2603.00511]] Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning(https://arxiv.org/abs/2603.00511)
Keywords: generation
Abstract: Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
摘要：视觉问答系统因幻觉而面临可靠性问题，模型生成的答案与视觉输入或事实知识不一致。虽然检索增强生成框架通过合并外部知识来缓解这个问题，但静态检索通常会引入不相关或冲突的内容，特别是在视觉 RAG 设置中，在视觉 RAG 设置中可能会检索到视觉上相似但语义上不正确的证据。为了解决这个问题，我们提出了多模态自适应 RAG（MMA-RAG），它动态评估模型内部知识的置信度，以决定是否将检索到的外部信息合并到生成过程中。 MMA-RAG 的核心是通过分层分析训练的决策分类器，它利用联合内部视觉和文本表示来指导反向图像检索的使用。实验表明，该模型在三个 VQA 数据集中的响应性能取得了显着提高。同时，消融研究强调了内部表征在自适应检索决策中的重要性。总的来说，实验结果表明，MMA-RAG 在不同的多模态场景中有效地平衡了外部知识利用和推理鲁棒性。

Title: Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Authors: Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00518
Pdf URL: https://arxiv.org/pdf/2603.00518
Copy Paste: [[2603.00518]] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training(https://arxiv.org/abs/2603.00518)
Keywords: generation
Abstract: Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
摘要：学习高效且富有表现力的视觉表示一直是计算机视觉研究的追求。虽然视觉变换器（ViT）逐渐取代传统的卷积神经网络（CNN）作为更具可扩展性的视觉学习器，但它们的应用受到自注意力机制的二次复杂度的困扰。为了应对这一挑战，我们在视觉中引入了一种新的线性时间序列建模方法测试时间训练（TTT），并提出了 Vision-TTT，它以一种新颖的自监督学习方式压缩视觉标记序列。通过结合双向扫描策略和 Conv2d 模块，Vision-TTT 有效地扩展了普通 TTT，以模拟与全局感受野的 2D 视觉相关性。大量实验表明，\texttt{Vittt-T/S/B} 在 ImageNet 分类上实现了 77.3%、81.2%、82.5% Top-1 准确率，并且在下游任务上也大大优于同类产品。在 1280x1280 分辨率下，\texttt{Vittt-T} 比 DeiT-T 减少了 79.4% 的 FLOP，运行速度提高了 4.38 倍，内存减少了 88.9%。这些结果证明了 Vision-TTT 作为下一代通用视觉主干的有力候选者的表现力和效率。

Title: Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness

Authors: Yuyang Chen, Linqian Zeng, Yijin ZHou, Hengjie Li, Jidong Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00519
Pdf URL: https://arxiv.org/pdf/2603.00519
Copy Paste: [[2603.00519]] Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness(https://arxiv.org/abs/2603.00519)
Keywords: generation, generative
Abstract: Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at this https URL.
摘要：扩散模型在生成人工智能领域取得了巨大的成功，但其计算效率仍然是一个重大挑战，特别是对于需要密集全注意力计算的扩散变压器（DiT）而言。虽然现有的加速方法侧重于与内容无关的统一优化策略，但我们观察到生成内容中的不同区域在去噪过程中表现出异构的收敛模式。我们推出了 Jano，这是一个免培训的框架，它利用这种洞察力来实现高效的区域感知生成。 Jano 引入了一种早期复杂性识别算法，可在初始去噪步骤中准确识别区域收敛要求，并结合可优化计算资源分配的自适应令牌调度运行时。通过对最先进模型的综合评估，Jano 在保持生成质量的同时实现了大幅加速（平均加速 2.0 倍，最高可达 2.4 倍）。我们的工作挑战了传统的统一处理假设，并为加速大规模内容生成提供了实用的解决方案。我们的实现的源代码可以在此 https URL 中找到。

Title: Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

Authors: Zhen Zhou, Jian Liu, Biwen Lei, Jing Xu, Haohan Weng, Yiling Zhu, Zhuo Chen, Junfeng Fan, Yunkai Ma, Dazhao Du, Song Guo, Fengshui Jing, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00526
Pdf URL: https://arxiv.org/pdf/2603.00526
Copy Paste: [[2603.00526]] Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation(https://arxiv.org/abs/2603.00526)
Keywords: generation
Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
摘要：强化学习 (RL) 在文本和图像生成方面取得了显着的成功，但其在 3D 生成方面的潜力在很大程度上仍未得到开发。现有的尝试通常依赖于离线直接偏好优化（DPO）方法，该方法存在训练效率低和泛化能力有限的问题。在这项工作中，我们的目标是提高 3D 网格生成中 RL 的训练效率和生成质量。具体来说，（1）我们设计了第一个专为 3D 网格生成训练后效率提升而定制的异步在线 RL 框架，比同步 RL 快 3.75$\times$。 (2) 我们提出了优势引导排名偏好优化 (ARPO)，这是一种新颖的 RL 算法，与当前为 3D 网格生成设计的 RL 算法（例如 DPO 和组相对策略优化 (GRPO)）相比，它在训练效率和泛化之间实现了更好的权衡。 (3) 基于异步 ARPO，我们提出了 Mesh-Pro，它还引入了一种新颖的对角感知混合三角四边形标记化用于网格表示和基于射线的几何完整性奖励。 Mesh-Pro 在艺术和密集网格上实现了最先进的性能。

Title: RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation

Authors: Xianhao Zhou, Jianghao Wu, Lanfeng Zhong, Ku Zhao, Jinlong He, Shaoting Zhang, Guotai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00535
Pdf URL: https://arxiv.org/pdf/2603.00535
Copy Paste: [[2603.00535]] RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation(https://arxiv.org/abs/2603.00535)
Keywords: generation
Abstract: Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT--CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at this https URL.
摘要：锥束 CT (CBCT) 是放射治疗中的常规采集，但存在严重的伪影和不可靠的亨斯菲尔德单位 (HU) 值，限制了其直接用于剂量计算。因此，从 CBCT 生成合成 CT (sCT) 是一项重要任务，但由于时间间隙、解剖变异和配准错误，配对的 CBCT-CT 数据通常不可用或不可靠。在这项工作中，我们将整流流 (RF) 引入医学成像中不成对的 CBCT 到 CT 转换中。尽管 RF 在理论上通过分布级耦合和确定性传输与不配对学习兼容，但其在小型医疗数据集和有限批量大小下的实际有效性仍未得到充分探索。由于端点样本语义不匹配，直接应用随机或批量本地伪配对可能会产生不稳定的监督。为了应对这一挑战，我们提出了检索增强流匹配（RAFM），它通过使用冻结的 DINOv3 编码器和全局 CT 存储库构建检索引导的伪对，使 RF 适应医疗环境。该策略提高了经验耦合质量并稳定了不成对的基于流的训练。在严格的主题级真实不配对协议下对 SynthRAD2023 进行的实验表明，RAFM 在 FID、MAE、SSIM、PSNR 和 SegScore 方面优于现有方法。该代码可从此 https URL 获取。

Title: Spectral Condition for $μ$P under Width-Depth Scaling

Authors: Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.00541
Pdf URL: https://arxiv.org/pdf/2603.00541
Copy Paste: [[2603.00541]] Spectral Condition for $μ$P under Width-Depth Scaling(https://arxiv.org/abs/2603.00541)
Keywords: generative
Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $\mu$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $\mu$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $\mu$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $\mu$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.
摘要：生成基础模型在宽度和深度上都在不断扩展，这对跨模型大小的稳定特征学习和可靠的超参数（HP）传输提出了重大挑战。虽然最大更新参数化（$\mu$P）为宽度缩放的这两个问题提供了原则性的解决方案，但联合宽度-深度缩放机制的现有扩展仍然是支离破碎的、特定于架构和优化器的，并且通常依赖于技术上涉及的理论。在这项工作中，我们为联合宽度深度缩放下的 $\mu$P 开发了一个简单且统一的光谱框架。考虑到不同块深度的残差网络，我们首先引入一个谱 $\mu$P 条件，它精确地描述了权重范数及其每步更新应如何随宽度和深度缩放，将之前不同的 $\mu$P 公式统一为特殊情况。在此条件的基础上，我们通过将谱约束映射到具体的 HP 参数化，得出了在广泛的优化器中实现 $\mu$P 的通用方法。这种方法不仅恢复现有的 $\mu$P 公式（例如，SGD 和 AdamW），而且自然地扩展到更广泛的优化器。最后，GPT-2 风格语言模型的实验表明，所提出的谱 $\mu$P 条件保留了稳定的特征学习，并在宽度深度缩放下实现了鲁棒的 HP 迁移。

Title: WildActor: Unconstrained Identity-Preserving Video Generation

Authors: Qin Guo, Tianyu Yang, Xuanhua He, Fei Shen, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00586
Pdf URL: https://arxiv.org/pdf/2603.00586
Copy Paste: [[2603.00586]] WildActor: Unconstrained Identity-Preserving Video Generation(https://arxiv.org/abs/2603.00586)
Keywords: generation
Abstract: Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.
摘要：制作就绪的人类视频生成要求数字演员在动态镜头、视点和动作中保持严格一致的全身身份，这对现有方法来说仍然具有挑战性。先前的方法经常遭受以面部为中心的行为，忽略了身体层面的一致性，或者产生复制粘贴伪像，其中主体由于姿势锁定而显得僵硬。我们提出了 Actor-18M，这是一个大规模的人类视频数据集，旨在捕获不受约束的视点和环境下的身份一致性。 Actor-18M 由 160 万个视频和 1800 万个相应的人类图像组成，涵盖任意视图和规范的三视图表示。利用 Actor-18M，我们提出了 WildActor，一个用于生成任意视图条件人类视频的框架。我们引入了一种不对称身份保留注意力机制，结合视点自适应蒙特卡罗采样策略，通过边际效用迭代地重新加权参考条件，以实现平衡的流形覆盖。在提议的 Actor-Bench 上进行评估，WildActor 在不同的镜头构图、大视点转换和大量运动下始终保持身体特征，在这些具有挑战性的环境中超越了现有方法。

Title: AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

Authors: Cencen Liu (1), Dongyang Zhang (1 and 2), Wen Yin (1), Jielei Wang (1 and 2), Tianyu Li (1), Ji Guo (1), Wenbo Jiang (1), Guoqing Wang (1), Guoming Lu (1 and 2) ((1) University of Electronic Science and Technology of China, (2) Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00589
Pdf URL: https://arxiv.org/pdf/2603.00589
Copy Paste: [[2603.00589]] AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution(https://arxiv.org/abs/2603.00589)
Keywords: super-resolution, generation, generative
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
摘要：视觉自回归 (VAR) 模型最近已成为图像生成的一种有前景的替代方案，通过下一个规模的预测提供稳定的训练、非迭代推理和高保真合成。这鼓励了对图像超分辨率 (ISR) 的 VAR 探索，但其应用仍未得到充分探索，并面临两个关键挑战：局部偏向注意力（会分散空间结构）和仅残差监督（会在尺度上累积误差），严重损害重建图像的全局一致性。为了解决这些问题，我们提出了AlignVAR，一种专为ISR定制的全局一致的视觉自回归框架，具有两个关键组成部分：（1）空间一致性自回归（SCA），它应用自适应掩模来重新调整对结构相关区域的注意力，从而减轻过度的局部性并增强远程依赖性；（2）分层一致性约束（HCC），它通过每个尺度的完全重建监督来增强残差学习，尽早暴露累积的偏差并稳定从粗到细的细化过程。大量实验表明，与现有的生成方法相比，AlignVAR 持续增强了结构一致性和感知保真度，同时与领先的基于扩散的方法相比，推理速度提高了 10 倍以上，参数减少了近 50%，从而建立了高效 ISR 的新范例。

Title: IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

Authors: Honghao Cai, Xiangyuan Wang, Yunhao Bai, Tianze Zhou, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00607
Pdf URL: https://arxiv.org/pdf/2603.00607
Copy Paste: [[2603.00607]] IdGlow: Dynamic Identity Modulation for Multi-Subject Generation(https://arxiv.org/abs/2603.00607)
Keywords: generation, generative
Abstract: Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
摘要：多主体图像生成需要在连贯场景中无缝协调多个参考身份。然而，现有的依赖于刚性空间掩模或局部注意力的方法往往会陷入“稳定性-可塑性困境”，特别是在需要复杂结构变形的任务中失败，例如保持身份的年龄转变。为了解决这个问题，我们提出了 IdGlow，这是一种基于流量匹配扩散模型的无掩模、渐进式两阶段框架。在监督微调（SFT）阶段，我们引入了与扩散生成动力学相一致的任务自适应时间步调度：逐步放宽自然群体组成约束的线性衰减调度，以及将身份注入集中在关键语义窗口内的时间门控机制，成功地保留了成人面部语义，而不会覆盖儿童般的解剖结构。为了在没有显式布局输入的情况下解决属性泄漏和语义歧义，我们进一步集成了由 badcase 驱动的视觉语言模型 (VLM)，以实现精确的上下文感知提示合成。在第二阶段，我们设计了具有加权裕度公式的细粒度组级直接偏好优化（DPO），以同时消除多主体伪影，提升纹理和谐度，并针对现实世界的分布重新校准身份保真度。对两个具有挑战性的基准（直接多人融合和年龄转变的群体生成）进行的大量实验表明，IdGlow 从根本上缓解了稳定性与可塑性冲突，在最先进的面部保真度和商业级审美质量之间实现了卓越的帕累托平衡。

Title: Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Authors: Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00643
Pdf URL: https://arxiv.org/pdf/2603.00643
Copy Paste: [[2603.00643]] Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered(https://arxiv.org/abs/2603.00643)
Keywords: restoration, generative, quality assessment
Abstract: This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.
摘要：本立场文件认为，现代视觉处理系统的评估不应再主要由单一度量图像质量评估基准驱动，特别是在生成和面向感知的方法时代。图像恢复体现了这种差异：虽然客观的 IQA 指标能够实现可重复、可扩展的评估，但它们越来越偏离人类的感知和用户偏好。我们认为，这种不匹配可能会限制视觉处理任务的创新并误导研究进展。本文并没有完全拒绝指标，而是呼吁重新平衡评估范式，提倡采用更加以人为中心、上下文感知和细粒度的方法来评估视觉模型的结果。

Title: Direct low-field MRI super-resolution using undersampled k-space

Authors: Daniel Tweneboah Anyimadu, Mohammed M. Abdelsamea, Ahmed Karam Eldaly
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00668
Pdf URL: https://arxiv.org/pdf/2603.00668
Copy Paste: [[2603.00668]] Direct low-field MRI super-resolution using undersampled k-space(https://arxiv.org/abs/2603.00668)
Keywords: super-resolution
Abstract: Low-field magnetic resonance imaging (MRI) provides affordable access to diagnostic imaging but suffers from prolonged acquisition and limited image quality. Accelerated imaging can be achieved with k-space undersampling, while super-resolution (SR) and image quality transfer (IQT) methods typically rely on spatial-domain post-processing. In this work, we propose a novel framework for reconstructing high-field MR like images directly from undersampled low-field k-space. Our approach employs a k-space dual channel U-Net that processes the real and imaginary components of undersampled k-space to restore missing frequency content. Experiments on low-field brain MRI demonstrate that our k-space-driven image enhancement consistently outperforms the counterpart spatial-domain method. Furthermore, reconstructions from undersampled k-space achieve image quality comparable to full k-space acquisitions. To the best of our knowledge, this is the first work that investigates low-field MRI SR/IQT directly from undersampled k-space.
摘要：低场磁共振成像 (MRI) 提供了经济实惠的诊断成像方法，但存在采集时间长且图像质量有限的问题。加速成像可以通过 k 空间欠采样来实现，而超分辨率 (SR) 和图像质量转移 (IQT) 方法通常依赖于空间域后处理。在这项工作中，我们提出了一种新的框架，用于直接从欠采样的低场 k 空间重建高场 MR 图像。我们的方法采用 k 空间双通道 U-Net，处理欠采样 k 空间的实部和虚部，以恢复丢失的频率内容。低场脑 MRI 实验表明，我们的 k 空间驱动的图像增强始终优于对应的空间域方法。此外，欠采样 k 空间重建可实现与全 k 空间采集相当的图像质量。据我们所知，这是第一项直接从欠采样 k 空间研究低场 MRI SR/IQT 的工作。

Title: SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion

Authors: Guoquan Wei, Liu Shi, Shaoyu Wang, Mohan Li, Cunfeng Wei, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00687
Pdf URL: https://arxiv.org/pdf/2603.00687
Copy Paste: [[2603.00687]] SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion(https://arxiv.org/abs/2603.00687)
Keywords: generation
Abstract: Noise and artifacts during computed tomography (CT) scans are a fundamental challenge affecting disease diagnosis. However, current methods either involve excessively long reconstruction times or rely on data-driven models for optimization, failing to adequately consider the valuable information inherent in the data itself, especially medical 3D data. This work proposes a reconstruction method under ultra-low raw data conditions, requiring no external data and avoiding lengthy pre-training processes. By leveraging spatial nonlocal similarity and the conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, high-fidelity results can be achieved in a very short time. Extensive experiments demonstrate that this method not only mitigates detector-induced ring artifacts but also exhibits unprecedented capabilities in detail recovery. This method provides a new paradigm for research using unlabeled raw projection data. Code is available at this https URL.
摘要：计算机断层扫描 (CT) 扫描过程中的噪声和伪影是影响疾病诊断的基本挑战。然而，当前的方法要么重建时间过长，要么依赖数据驱动模型进行优化，未能充分考虑数据本身固有的有价值的信息，尤其是医学3D数据。这项工作提出了一种超低原始数据条件下的重建方法，不需要外部数据，避免了冗长的预训练过程。通过利用空间非局部相似性和投影域的共轭特性来生成用于自监督训练的伪 3D 数据，可以在很短的时间内获得高保真结果。大量实验表明，这种方法不仅可以减轻探测器引起的环形伪影，而且在细节恢复方面表现出前所未有的能力。该方法为使用未标记的原始投影数据的研究提供了新的范例。代码可从此 https URL 获取。

Title: Diversity over Uniformity: Rethinking Representation in Generated Image Detection

Authors: Qinghui He, Haifeng Zhang, Qiao Qin, Bo Liu, Xiuli Bi, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00717
Pdf URL: https://arxiv.org/pdf/2603.00717
Copy Paste: [[2603.00717]] Diversity over Uniformity: Rethinking Representation in Generated Image Detection(https://arxiv.org/abs/2603.00717)
Keywords: generative
Abstract: With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at this https URL.
摘要：随着生成模型的快速进步，生成图像检测已成为视觉取证中的重要任务。尽管现有方法取得了显着的进步，但在训练后，它们通常仅依赖于一小部分高度显着的伪造线索，这限制了它们推广到看不见的生成机制的能力。我们认为，可靠生成的图像检测不应依赖于单一决策路径，而应保留多个判断视角，使模型能够从不同的角度理解真实图像和生成图像之间的差异。基于这个想法，我们提出了一种反特征崩溃学习框架，该框架可以过滤与任务无关的组件并抑制表示空间中不同伪造线索之间的过度重叠，从而防止判别信息崩溃到几个主要特征方向。这种设计在模型中保留了多样化和互补的证据，减少了对一小部分显着线索的依赖，并增强了在看不见的生成环境下的稳健性。对多个公共基准的大量实验表明，所提出的方法在跨模型场景中显着优于最先进的方法，实现了 5.02% 的精度提升，并表现出优异的泛化性和检测可靠性。源代码可从此 https URL 获取。

Title: General Proximal Flow Networks

Authors: Alexander Strunk, Roland Assam
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00751
Pdf URL: https://arxiv.org/pdf/2603.00751
Copy Paste: [[2603.00751]] General Proximal Flow Networks(https://arxiv.org/abs/2603.00751)
Keywords: generation, generative
Abstract: This paper introduces General Proximal Flow Networks (GPFNs), a generalization of Bayesian Flow Networks that broadens the class of admissible belief-update operators. In Bayesian Flow Networks, each update step is a Bayesian posterior update, which is equivalent to a proximal step with respect to the Kullback-Leibler divergence. GPFNs replace this fixed choice with an arbitrary divergence or distance function, such as the Wasserstein distance, yielding a unified proximal-operator framework for iterative generative modeling. The corresponding training and sampling procedures are derived, establishing a formal link to proximal optimization and recovering the standard BFN update as a special case. Empirical evaluations confirm that adapting the divergence to the underlying data geometry yields measurable improvements in generation quality, highlighting the practical benefits of this broader framework.
摘要：本文介绍了通用近端流网络（GPFN），它是贝叶斯流网络的推广，拓宽了可接受的置信更新算子的类别。在贝叶斯流网络中，每个更新步骤都是贝叶斯后验更新，相当于 Kullback-Leibler 散度的近端步骤。 GPFN 用任意散度或距离函数（例如 Wasserstein 距离）取代了这种固定选择，从而为迭代生成建模提供了统一的近端算子框架。推导了相应的训练和采样程序，建立了与近端优化的正式联系，并恢复了标准 BFN 更新作为特例。实证评估证实，使散度适应基础数据几何形状可以显着提高生成质量，突显了这一更广泛框架的实际好处。

Title: Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning

Authors: Karanpartap Singh, Adam Turnbull, Mohammad Abbasi, Kilian Pohl, Feng Vankee Lin, Ehsan Adeli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00786
Pdf URL: https://arxiv.org/pdf/2603.00786
Copy Paste: [[2603.00786]] Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning(https://arxiv.org/abs/2603.00786)
Keywords: generation
Abstract: Understanding how large-scale functional brain networks reorganize during cognitive decline remains a central challenge in neuroimaging. While recent self-supervised models have shown promise for learning representations from resting-state fMRI, their internal mechanisms are difficult to interpret, limiting mechanistic insight. We propose BrainInterNet, a network-aware self-supervised framework based on masked reconstruction with cross-attention that explicitly models inter-network dependencies in rs-fMRI. By selectively masking predefined functional networks and reconstructing them from remaining context, our approach enables direct quantification of network predictability and interpretable analysis of cross-network interactions. We train BrainInterNet on multi-cohort fMRI data (from the ABCD, HCP Development, HCP Young Adults, and HCP Aging datasets) and evaluate on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, in total comprising 5,582 recordings. Our method reveals systematic alterations in the brain's network interactions under AD, including in the default mode, limbic, and attention networks. In parallel, the learned representations support accurate Alzheimer's-spectrum classification and yield a compact summary marker that tracks disease severity longitudinally. Together, these results demonstrate that network-guided masked modeling with cross-attention provides an interpretable and effective framework for characterizing functional reorganization in neurodegeneration.
摘要：了解大规模功能性大脑网络在认知能力下降期间如何重组仍然是神经影像学的一个核心挑战。虽然最近的自我监督模型显示出从静息态功能磁共振成像中学习表征的希望，但它们的内部机制很难解释，限制了机械洞察力。我们提出了 BrainInterNet，这是一种基于交叉注意力的掩模重建的网络感知自监督框架，可显式模拟 rs-fMRI 中的网络间依赖关系。通过有选择地屏蔽预定义的功能网络并根据剩余上下文重建它们，我们的方法可以直接量化网络可预测性和跨网络交互的可解释分析。我们在多队列 fMRI 数据（来自 ABCD、HCP Development、HCP Young Adults 和 HCP Aging 数据集）上训练 BrainInterNet，并在阿尔茨海默病神经影像计划 (ADNI) 数据集上进行评估，总共包含 5,582 个记录。我们的方法揭示了 AD 下大脑网络交互的系统性改变，包括默认模式、边缘系统和注意力网络。同时，学习到的表示支持准确的阿尔茨海默病谱分类，并产生一个紧凑的总结标记，可以纵向跟踪疾病的严重程度。总之，这些结果表明，具有交叉注意力的网络引导屏蔽建模为表征神经退行性变中的功能重组提供了可解释且有效的框架。

Title: COMBAT: Conditional World Models for Behavioral Agent Training

Authors: Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp, Shahbuland Matiana, Louis Castricato, Spencer Frazier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00825
Pdf URL: https://arxiv.org/pdf/2603.00825
Copy Paste: [[2603.00825]] COMBAT: Conditional World Models for Behavioral Agent Training(https://arxiv.org/abs/2603.00825)
Keywords: generation
Abstract: Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.
摘要：视频生成领域的最新进展促进了能够模拟 3D 一致环境以及与静态对象交互的世界模型的发展。然而，它们对能够智能地影响世界并与世界互动的动态、反应性代理进行建模的能力仍然存在很大的限制。为了解决这一差距，我们引入了 COMBAT，这是一种在复杂的 1v1 格斗游戏《铁拳 3》上训练的实时、动作控制的世界模型。我们的工作表明，扩散模型可以成功模拟对玩家动作做出反应的动态对手，隐式学习其行为。我们的方法利用 12 亿参数的扩散变换器，以深度压缩自动编码器的潜在表示为条件。我们采用最先进的技术，包括因果蒸馏和扩散强迫，来实现实时推理。至关重要的是，我们通过仅根据单人输入训练模型来观察复杂代理行为的出现，而无需对对手的策略进行任何明确的监督。与需要完整动作标签的传统模仿学习方法不同，COMBAT 可以从部分观察到的数据中有效学习，为可控玩家 1 生成响应行为。我们提出了一项广泛的研究，并引入了新颖的评估方法来对这种新兴代理行为进行基准测试，为在基于扩散的世界模型中训练交互式代理奠定了坚实的基础。

Title: Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement

Authors: Cong Wang, Jinshan Pan, Liyan Wang, Wei Wang, Yang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00853
Pdf URL: https://arxiv.org/pdf/2603.00853
Copy Paste: [[2603.00853]] Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement(https://arxiv.org/abs/2603.00853)
Keywords: restoration, super-resolution
Abstract: We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement. Our UHDPromer is inspired by an interesting observation that there implicitly exist neural differences between high-resolution and low-resolution features, and exploring such differences can facilitate low-resolution feature representation. To this end, we first introduce Neural Discrimination Priors (NDP) to measure the differences and then integrate NDP into the proposed Neural Discrimination-Prompted Attention (NDPA) and Neural Discrimination-Prompted Network (NDPN). The proposed NDPA re-formulates the attention by incorporating NDP to globally perceive useful discrimination information, while the NDPN explores a continuous gating mechanism guided by NDP to selectively permit the passage of beneficial content. To enhance the quality of restored images, we propose a super-resolution-guided reconstruction approach, which is guided by super-resolving low-resolution features to facilitate final UHD image restoration. Experiments show that UHDPromer achieves the best computational efficiency while still maintaining state-of-the-art performance on $3$ UHD image restoration and enhancement tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes and pre-trained models will be made available at this https URL.
摘要：我们提出了一种简单而有效的 UHDPromer，一种神经辨别提示的 Transformer，用于超高清 (UHD) 图像恢复和增强。我们的 UHDPromer 受到一个有趣的观察的启发，即高分辨率和低分辨率特征之间隐含地存在神经差异，探索这些差异可以促进低分辨率特征表示。为此，我们首先引入神经判别先验（NDP）来衡量差异，然后将 NDP 集成到提出的神经判别提示注意（NDPA）和神经判别提示网络（NDPN）中。拟议的 NDPA 通过纳入 NDP 来重新表述注意力，以全局感知有用的歧视信息，而 NDPN 则探索由 NDP 引导的连续门控机制，以选择性地允许有益内容的通过。为了提高恢复图像的质量，我们提出了一种超分辨率引导的重建方法，该方法以超分辨率低分辨率特征为指导，以促进最终的超高清图像恢复。实验表明，UHDPromer 在 3 美元的超高清图像恢复和增强任务（包括低光图像增强、图像去雾和图像去模糊）中实现了最佳计算效率，同时仍然保持最先进的性能。源代码和预训练模型将在此 https URL 中提供。

Title: Active Flow Matching

Authors: Yashvir S. Grewal, Daniel M. Steinberg, Thang D. Bui, Cheng Soon Ong, Edwin V. Bonilla
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00877
Pdf URL: https://arxiv.org/pdf/2603.00877
Copy Paste: [[2603.00877]] Active Flow Matching(https://arxiv.org/abs/2603.00877)
Keywords: generative
Abstract: Discrete diffusion and flow matching models capture complex, non-additive and non-autoregressive structure in high-dimensional objective landscapes through parallel, iterative refinement. However, their implicit generative nature precludes direct integration with principled variational frameworks for online black-box optimisation, such as variational search distributions (VSD) and conditioning by adaptive sampling (CbAS). We introduce Active Flow Matching (AFM), which reformulates variational objectives to operate on conditional endpoint distributions along the flow, enabling gradient-based steering of flow models toward high-fitness regions while preserving the rigour of VSD and CbAS. We derive forward and reverse Kullback-Leibler (KL) variants using self-normalised importance sampling. Across a suite of online protein and small molecule design tasks, forward-KL AFM consistently performs competitively compared to state-of-the-art baselines, demonstrating effective exploration-exploitation under tight experimental budgets.
摘要：离散扩散和流动匹配模型通过并行、迭代细化捕获高维客观景观中复杂的、非加性和非自回归的结构。然而，它们隐含的生成性质妨碍了与在线黑盒优化的原则变分框架的直接集成，例如变分搜索分布（VSD）和自适应采样调节（CbAS）。我们引入了主动流匹配 (AFM)，它重新制定了变分目标，以对沿流的条件端点分布进行操作，从而实现基于梯度的流模型转向高适应度区域，同时保持 VSD 和 CbAS 的严格性。我们使用自归一化重要性采样导出前向和反向 Kullback-Leibler (KL) 变体。在一系列在线蛋白质和小分子设计任务中，前向 KL AFM 与最先进的基线相比始终具有竞争力，证明了在紧张的实验预算下的有效探索-利用。

Title: Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

Authors: Michael Hardy, Yunsung Kim
Subjects: cs.LG, cs.AI, cs.CY, stat.AP
Abstract URL: https://arxiv.org/abs/2603.00883
Pdf URL: https://arxiv.org/pdf/2603.00883
Copy Paste: [[2603.00883]] Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact(https://arxiv.org/abs/2603.00883)
Keywords: generative
Abstract: LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50\% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.
摘要：法学硕士在人工智能基准测试上越来越出色，但这样做并不能保证下游任务的有效性。本研究评估了领先基础模型（FM，即生成式预训练基础法学硕士）在学童教学和学习中的分布外（OOD）任务的表现。在所有 FM 中，不同任务上的模型间行为的相关性高于目标任务上的专家人类行为的相关性。法学硕士之间共有的这些偏见与下游的教学质量衡量标准不一致，并且通常\textit{与学习成果负相关}。此外，我们发现多模型集成，无论是一致的模型投票还是按基准性能进行专家加权，都进一步加剧了学习的偏差。我们测量发现，50% 的错位误差变化是在基础模型之间共享的，这表明常见的预训练是这些任务中大部分错位的原因。我们展示了稳健地测量复杂任务一致性的方法，并为基础模型的教育应用和理解模型的局限性提供了独特的见解。

Title: Probabilistic Learning and Generation in Deep Sequence Models

Authors: Wenlong Chen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.00888
Pdf URL: https://arxiv.org/pdf/2603.00888
Copy Paste: [[2603.00888]] Probabilistic Learning and Generation in Deep Sequence Models(https://arxiv.org/abs/2603.00888)
Keywords: generation, generative
Abstract: Despite exceptional predictive performance of Deep sequence models (DSMs), the main concern of their deployment centers around the lack of uncertainty awareness. In contrast, probabilistic models quantify the uncertainty associated with unobserved variables with rules of probability. Notably, Bayesian methods leverage Bayes' rule to express our belief of unobserved variables in a principled way. Since exact Bayesian inference is computationally infeasible at scale, approximate inference is required in practice. Two major bottlenecks of Bayesian methods, especially when applied in deep neural networks, are prior specification and approximation quality. In Chapter 3 & 4, we investigate how the architectures of DSMs themselves can be informative for the design of priors or approximations in probabilistic models. We first develop an approximate Bayesian inference method tailored to the Transformer based on the similarity between attention and sparse Gaussian process. Next, we exploit the long-range memory preservation capability of HiPPOs (High-order Polynomial Projection Operators) to construct an interdomain inducing point for Gaussian process, which successfully memorizes the history in online learning. In addition to the progress of DSMs in predictive tasks, sequential generative models consisting of a sequence of latent variables are popularized in the domain of deep generative models. Inspired by the explicit self-supervised signals for these latent variables in diffusion models, in Chapter 5, we explore the possibility of improving other generative models with self-supervision for their sequential latent states, and investigate desired probabilistic structures over them. Overall, this thesis leverages inductive biases in DSMs to design probabilistic inference or structure, which bridges the gap between DSMs and probabilistic models, leading to mutually reinforced improvement.
摘要：尽管深度序列模型（DSM）具有出色的预测性能，但其部署的主要问题集中在缺乏不确定性意识上。相反，概率模型用概率规则量化与未观察到的变量相关的不确定性。值得注意的是，贝叶斯方法利用贝叶斯规则以原则性的方式表达我们对未观察到的变量的信念。由于精确的贝叶斯推理在大规模计算上是不可行的，因此在实践中需要近似推理。贝叶斯方法的两个主要瓶颈，尤其是在深度神经网络中应用时，是先验规范和近似质量。在第 3 章和第 4 章中，我们研究了 DSM 本身的架构如何为概率模型中的先验或近似设计提供信息。我们首先基于注意力和稀疏高斯过程之间的相似性，开发了一种适合 Transformer 的近似贝叶斯推理方法。接下来，我们利用 HiPPO（高阶多项式投影算子）的远程记忆保存能力为高斯过程构造一个域间诱导点，成功地记住了在线学习的历史。除了 DSM 在预测任务中的进展之外，由一系列潜在变量组成的顺序生成模型在深度生成模型领域也得到了普及。受到扩散模型中这些潜在变量的显式自我监督信号的启发，在第 5 章中，我们探索了通过对其连续潜在状态进行自我监督来改进其他生成模型的可能性，并研究了它们所需的概率结构。总体而言，本文利用 DSM 中的归纳偏差来设计概率推理或结构，从而弥合了 DSM 和概率模型之间的差距，从而实现相互促进的改进。

Title: pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Authors: Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00905
Pdf URL: https://arxiv.org/pdf/2603.00905
Copy Paste: [[2603.00905]] pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning(https://arxiv.org/abs/2603.00905)
Keywords: generation
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
摘要：多模态大型语言模型 (MLLM) 在通用感知和推理方面表现出了强大的能力，但它们仍然难以完成需要对 3D 世界进行空间理解的任务。为了解决这个问题，我们引入了 pySpatial，这是一个可视化编程框架，使 MLLM 能够通过 Python 代码生成与空间工具交互。给定图像序列和自然语言查询，该模型组成对空间工具的函数调用，包括 3D 重建、相机姿势恢复、新颖视图渲染等。这些操作将原始 2D 输入转换为可探索的 3D 场景，使 MLLM 能够对结构化空间表示进行显式推理。值得注意的是，pySpatial 不需要基于梯度的微调，并且在完全零样本设置下运行。对具有挑战性的 MindCube 和 Omni3D-Bench 基准的实验评估表明，我们的框架 pySpatial 始终超越强大的 MLLM 基线；例如，它在 MindCube 上的性能比 GPT-4.1-mini 高出 12.94%。此外，我们进行了现实世界的室内导航实验，机器人可以使用 pySpatial 生成的路线计划成功穿越复杂的环境，突出了我们方法的实际有效性。

Title: ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration

Authors: Xiaolong Zeng, Yitong Yu, Shiyao Xiong, Jinhua Hao, Ming Sun, Chao Zhou, Bin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00906
Pdf URL: https://arxiv.org/pdf/2603.00906
Copy Paste: [[2603.00906]] ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration(https://arxiv.org/abs/2603.00906)
Keywords: restoration
Abstract: Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.
摘要：基于查找表的方法已成为高效图像恢复任务的一个有前途的方向。最近基于 LUT 的方法侧重于通过扩大感受野来提高其性能。然而，它们不可避免地会带来额外的计算和存储开销，这阻碍了它们在边缘设备中的部署。为了解决这个问题，我们提出了 ShiftLUT，这是一种新颖的框架，它在所有基于 LUT 的方法中获得最大的感受野，同时保持高效率。我们的主要见解在于三个互补的组成部分。首先，引入可学习空间平移模块（LSS），通过在特征图上应用可学习的、通道方式的空间偏移来扩展感受野。其次，我们提出了一种非对称双分支架构，将更多计算分配给信息密集分支，从而在不影响恢复质量的情况下显着减少推理延迟。最后，我们采用了一种称为误差有限自适应采样 (EAS) 的特征级 LUT 压缩策略，以最大限度地减少存储开销。与之前最先进的方法 TinyLUT 相比，ShiftLUT 的感受野增大了 3.8 倍，在多个标准基准测试中将平均 PSNR 提高了 0.21 dB 以上，同时保持较小的存储大小和推理时间。

Title: VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Authors: Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00912
Pdf URL: https://arxiv.org/pdf/2603.00912
Copy Paste: [[2603.00912]] VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection(https://arxiv.org/abs/2603.00912)
Keywords: generation
Abstract: Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
摘要：当前的多视图室内 3D 物体检测器依赖于获取成本高昂的传感器几何形状（即精确校准的多视图相机姿态）来将多视图信息融合到全局场景表示中，从而限制了在现实世界场景中的部署。我们的目标是更实用的设置：无传感器几何（SG-Free）多视图室内 3D 物体检测，其中没有传感器提供的几何输入（多视图姿势或深度）。最近的视觉几何接地变压器 (VGGT) 表明，可以直接从图像推断出强烈的 3D 线索。基于这一见解，我们提出了 VGGT-Det，这是第一个专为无 SG 多视图室内 3D 物体检测而定制的框架。我们的方法不是仅仅使用 VGGT 预测，而是将 VGGT 编码器集成到基于 Transformer 的管道中。为了有效地利用 VGGT 内部的语义和几何先验，我们引入了两个新颖的关键组件：（i）注意力引导查询生成（AG）：利用 VGGT 注意力图作为语义先验来初始化对象查询，通过关注对象区域来改进定位，同时保留全局空间结构； (ii) 查询驱动的特征聚合 (QD)：可学习的 See-Query 与对象查询交互以“查看”它们所需的内容，然后跨 VGGT 层动态聚合多级几何特征，逐步将 2D 特征提升为 3D。实验表明，VGGT-Det 在 ScanNet 和 ARKitScenes 上分别显着超过了 SG-Free 设置中性能最佳的方法 4.4 和 8.6 mAP@0.25。消融研究表明，VGGT 内部学习的语义和几何先验可以被我们的 AG 和 QD 有效利用。

Title: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Authors: Seungwook Kim, Minsu Cho
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00918
Pdf URL: https://arxiv.org/pdf/2603.00918
Copy Paste: [[2603.00918]] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards(https://arxiv.org/abs/2603.00918)
Keywords: generation, generative
Abstract: Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
摘要：文本到图像的生成支持跨设计、媒体和数据增强的内容创建。文本到图像生成模型的后训练是更好地匹配人类偏好、事实性和改进美学的一条有前途的途径。我们引入了 ARC（自信自适应奖励），这是一种训练后框架，用内部自信信号取代外部奖励监督，该信号是通过评估模型在自降噪探针下恢复注入噪声的准确程度而获得的。 ARC 将这种内在信号转换为标量奖励，从而实现完全无监督的优化，无需额外的数据集、注释器或奖励模型。根据经验，通过增强高置信度生成，ARC 在基线上的合成生成、文本渲染和文本图像对齐方面提供了一致的收益。我们还发现，将 ARC 与外部奖励相结合会产生互补的改进，并减少奖励黑客行为。

Title: DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving

Authors: Zhiye Wang, Yanbo Jiang, Rui Zhou, Bo Zhang, Fang Zhang, Zhenhua Xu, Yaqin Zhang, Jianqiang Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.00919
Pdf URL: https://arxiv.org/pdf/2603.00919
Copy Paste: [[2603.00919]] DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving(https://arxiv.org/abs/2603.00919)
Keywords: generation
Abstract: Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
摘要：大型语言模型（LLM）在自动驾驶方面展现出了巨大的前景。然而，将数字离散化为token限制了精确的数值推理，无法体现数字在训练目标中的位置意义，难以兼顾解码效率和数值精度。这些限制影响传感器测量的处理和精确控制命令的生成，为部署基于 LLM 的自动驾驶系统造成了根本障碍。在本文中，我们介绍了 DriveCode，一种新颖的数字编码方法，它将数字表示为专用嵌入而不是离散的文本标记。 DriveCode 采用数字投影仪将数字映射到语言模型的隐藏空间中，从而能够在统一的多模式序列中与视觉和文本特征无缝集成。在 OmniDrive、DriveGPT4 和 DriveGPT4-V2 数据集上进行评估后，DriveCode 在轨迹预测和控制信号生成方面展示了卓越的性能，证实了其对于基于 LLM 的自动驾驶系统的有效性。

Title: Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos

Authors: Shreshth Saini, Bowen Chen, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00938
Pdf URL: https://arxiv.org/pdf/2603.00938
Copy Paste: [[2603.00938]] Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos(https://arxiv.org/abs/2603.00938)
Keywords: quality assessment
Abstract: High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
摘要：高动态范围 (HDR) 用户生成 (UGC) 视频在社交平台上迅速激增，但大多数感知视频质量评估 (VQA) 系统仍然针对标准动态范围 (SDR) 进行定制。 HDR 具有更高的位深度、更宽的色域和更高的亮度范围，暴露出诸如近黑破碎、高光剪切、条带和曝光闪烁等失真现象，这些失真会放大 UGC 伪影并对 SDR 模型提出挑战。为了促进进步，我们策划了 Beyond8Bits，这是一个大规模主观数据集，包含来自 6500 个来源的 44K 视频，拥有超过 150 万的人群评分，涵盖不同的场景、捕捉条件和压缩设置。我们进一步介绍 HDR-Q，这是第一个用于 HDR-UGC VQA 的多模态大语言模型 (MLLM)。我们提出（i）一种新颖的 HDR 感知视觉编码器来生成 HDR 敏感嵌入，以及（ii）HDR 感知策略优化（HAPO），这是一种将推理锚定到 HDR 线索的 RL 微调框架。 HAPO 通过 HDR-SDR 对比 KL 增强了 GRPO，KL 鼓励对 HDR 输入的代币依赖，以及用于细粒度 MOS 校准的高斯加权回归奖励。在 Beyond8Bits 和公共 HDR-VQA 基准测试中，HDR-Q 提供了最先进的性能。

Title: \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On

Authors: Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00947
Pdf URL: https://arxiv.org/pdf/2603.00947
Copy Paste: [[2603.00947]] \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On(https://arxiv.org/abs/2603.00947)
Keywords: generation
Abstract: Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
摘要：虚拟试穿 (VTON) 最近实现了令人印象深刻的视觉保真度，但大多数现有系统需要将个人照片上传到基于云的 GPU，这引发了隐私问题并限制了设备上的部署。为了解决这个问题，我们提出了 \textsc{Mobile-VTON}，这是一个高质量的隐私保护框架，仅使用单个用户图像和服装图像即可在商用移动设备上实现完全离线虚拟试穿。 \textsc{Mobile-VTON} 引入了模块化 TeacherNet--GarmentNet--TryonNet (TGT) 架构，该架构将知识蒸馏、服装调节生成和服装对齐集成到针对设备上效率进行优化的统一管道中。在此框架内，我们提出了一种特征引导对抗（FGA）蒸馏策略，将教师监督与对抗学习相结合，以更好地匹配现实世界的图像分布。 GarmentNet 采用轨迹一致性损失进行训练，以在扩散步骤中保留服装语义，而 TryonNet 使用潜在串联和轻量级跨模态调节来实现稳健的服装与人对齐，而无需大规模预训练。通过组合这些组件，\textsc{Mobile-VTON} 以较低的计算开销实现了高保真度生成。在 VITON-HD 和 DressCode 上进行的实验（价格为 1024{\times}768$）表明，它在完全离线运行时匹配或优于基于服务器的强大基线。这些结果表明，高质量的 VTON 不仅可行，而且在设备上也很实用，为实际应用提供了安全的解决方案。

Title: Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

Authors: Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00975
Pdf URL: https://arxiv.org/pdf/2603.00975
Copy Paste: [[2603.00975]] Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models(https://arxiv.org/abs/2603.00975)
Keywords: generative
Abstract: Unlearning in text-to-image diffusion models often leads to uneven concept removal and unintended forgetting of unrelated capabilities. This complicates tasks such as copyright compliance, protected data mitigation, artist opt-outs, and policy-driven content updates. As models grow larger and adopt more diverse architectures, achieving precise and selective unlearning while preserving generative quality becomes increasingly challenging. We introduce SurgUn (pronounced as Surgeon), a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models. Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones by competing for shared representational pathways. We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept while preserving unrelated capabilities through a novel training paradigm. SurgUn achieves high-precision unlearning across diverse settings. It performs strongly on compact U-Net based models such as Stable Diffusion v1.5, scales effectively to the larger U-Net architecture SDXL, and extends to SANA, representing an underexplored Diffusion Transformer based architecture for unlearning.
摘要：文本到图像扩散模型中的遗忘通常会导致概念删除不均匀以及无意中忘记不相关的功能。这使得版权合规性、受保护的数据缓解、艺术家选择退出和策略驱动的内容更新等任务变得复杂。随着模型变得越来越大并采用更加多样化的架构，在保持生成质量的同时实现精确和选择性的遗忘变得越来越具有挑战性。我们引入了 SurgUn（发音为 Surgeon），这是一种外科手术遗忘方法，它应用有针对性的权重空间更新来删除文本条件扩散模型中的特定视觉概念。我们的方法是受到追溯干扰理论的启发，该理论认为新获得的记忆可以通过竞争共享的表征路径来覆盖、抑制或阻止对先前记忆的访问。我们通过引入追溯概念干扰，使这一原理适用于扩散模型，从而仅实现目标概念的集中不稳定，同时通过新颖的训练范式保留不相关的能力。 SurgUn 在不同的环境中实现了高精度的忘却。它在基于紧凑型 U-Net 的模型（例如 Stable Diffusion v1.5）上表现强劲，可以有效地扩展到更大的 U-Net 架构 SDXL，并扩展到 SANA，代表了一种尚未充分开发的基于 Diffusion Transformer 的用于取消学习的架构。

Title: PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation

Authors: Jiangshan Wang, Kang Zhao, Jiayi Guo, Jiayu Wang, Hang Guo, Chenyang Zhu, Xiu Li, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00976
Pdf URL: https://arxiv.org/pdf/2603.00976
Copy Paste: [[2603.00976]] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation(https://arxiv.org/abs/2603.00976)
Keywords: generation
Abstract: High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of 2.6x speedup without noticeable quality loss. Source code will be released.
摘要：高计算成本和慢推理阻碍了视频生成模型的实际应用。虽然之前的工作通过特征缓存加速了生成过程，但它们经常遭受明显的质量下降。在这项工作中，我们揭示了这个问题是由于他们无法区分真正冗余的特征，从而导致无意中跳过重要特征的计算。为了解决这个问题，我们提出了 \textbf{PreciseCache}，这是一个即插即用的框架，可以精确检测并跳过真正冗余的计算，从而在不牺牲质量的情况下加速推理。具体来说，PreciseCache 包含两个组件：用于分步缓存的 LFCache 和用于分块缓存的 BlockCache。对于 LFCache，我们计算当前步骤的预测特征与上一个缓存步骤的预测特征之间的低频差异 (LFD)。根据经验，我们观察到 LFD 是一种有效的逐步冗余度量方法，可以准确检测高度冗余的步骤，通过重用缓存的特征可以跳过这些步骤的计算。为了进一步加速每个不可跳过步骤中的生成，我们提出了 BlockCache，它可以精确检测并跳过网络内块级别的冗余计算。对各种主干网的大量实验证明了我们的 PreciseCache 的有效性，它实现了平均 2.6 倍的加速，而没有明显的质量损失。源代码将被发布。

Title: EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization

Authors: Zhaoxin Fan, Nanxiang Jiang, Daiheng Gao, Shiji Zhou, Wenjun Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00978
Pdf URL: https://arxiv.org/pdf/2603.00978
Copy Paste: [[2603.00978]] EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization(https://arxiv.org/abs/2603.00978)
Keywords: generation, generative
Abstract: Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.
摘要：从大规模文本到图像 (T2I) 和文本到视频 (T2V) 扩散模型中删除不需要的概念，同时保持整体生成质量仍然是一个重大挑战，特别是当 Stable Diffusion v3、Flux 和 OpenSora 等现代模型采用流匹配和基于转换器的架构并扩展到长视野视频生成时。为早期 T2I/T2V 模型设计的现有概念擦除方法通常无法推广到这些范例。为了解决这个问题，我们提出了 EraseAnything++，这是一个用于在具有流匹配目标的图像和视频扩散模型中进行概念擦除的统一框架。我们方法的核心是将概念擦除制定为受约束的多目标优化问题，明确平衡概念删除与生成效用的保留。为了解决由此产生的相互冲突的目标，我们引入了一种基于隐式梯度手术的有效的保留效用的忘却策略。此外，通过将基于 LoRA 的参数调整与注意力级别正则化相结合，我们的方法将擦除锚定在关键视觉表示上，并在空间和时间维度上一致地传播它。在视频设置中，我们通过锚定和传播机制进一步增强一致性，该机制初始化参考帧上的擦除并在整个后续变压器层中强制执行，从而减轻时间漂移。对图像和视频基准的大量实验表明，EraseAnything++ 在擦除有效性、生成保真度和时间一致性方面大大优于现有方法，为下一代扩散模型中的概念擦除建立了新的技术水平。

Title: The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers

Authors: Jiaqi Tang, Weixuan Xu, Shu Zhang, Fandong Zhang, Qingchao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00985
Pdf URL: https://arxiv.org/pdf/2603.00985
Copy Paste: [[2603.00985]] The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers(https://arxiv.org/abs/2603.00985)
Keywords: generation
Abstract: Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.
摘要：视觉转换器 (ViT) 彻底改变了医学图像分析，但其数据匮乏的本质与临床档案的稀缺性和隐私限制发生冲突。公式驱动的监督学习（FDSL）已成为解决这一瓶颈的有前途的解决方案，它可以在不利用真实患者数据的情况下从数学公式合成无限带注释的样本。然而，现有的 FDSL 范例依赖于强度均匀的简单几何形状，忽略了 CT 和 MRI 等模式固有的组织纹理和噪声模式，从而产生了巨大的差距。在本文中，我们确定了一个称为边界混叠的关键优化冲突：当天真地添加高频合成纹理时，它们会破坏学习结构边界所需的图像梯度信号，导致模型无法描绘真实的解剖边缘。为了弥补这一差距，我们提出了一种新颖的受物理学启发的空间解耦综合框架。我们的方法正交化合成过程：它首先根据边界距离构造一个梯度屏蔽缓冲区以确保稳定的形状学习，然后将物理驱动的光谱纹理注入到对象核心中。该设计有效地协调了稳健的形状表示学习与采集噪声的不变性。对 BTCV 和 MSD 数据集的大量实验表明，我们的方法显着优于之前的 FDSL 以及在真实世界医疗数据集上训练的 SSL 方法，在 BTCV 上优于 1.43%，在 MSD 任务上高达 1.51%，为医疗 ViT 提供了可扩展、无注释的基础。该代码将在接受后公开发布。

Title: Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual Information

Authors: Xinwen Cheng, Jingyuan Zhang, Zhehao Huang, Yingwen Wu, Xiaolin Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00992
Pdf URL: https://arxiv.org/pdf/2603.00992
Copy Paste: [[2603.00992]] Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual Information(https://arxiv.org/abs/2603.00992)
Keywords: generation, generative
Abstract: The powerful generative capabilities of diffusion models have raised growing privacy and safety concerns regarding generating sensitive or undesired content. In response, machine unlearning (MU) -- commonly referred to as concept erasure (CE) in diffusion models -- has been introduced to remove specific knowledge from model parameters meanwhile preserving innocent knowledge. Despite recent advancements, existing unlearning methods often suffer from excessive and indiscriminate removal, which leads to substantial degradation in the quality of innocent generations. To preserve model utility, prior works rely on compensation, i.e., re-assimilating a subset of the remaining data or explicitly constraining the divergence from the pre-trained model on remaining concepts. However, we reveal that generations beyond the compensation scope still suffer, suggesting such post-remedial compensations are inherently insufficient for preserving the general utility of large-scale generative models. Therefore, in this paper, we advocate for developing compensation-free concept erasure operations, which precisely identify and eliminate the undesired knowledge such that the impact on other generations is minimal. In technique, we propose to MiM-MU, which is to unlearn a concept by minimizing the mutual information with a delicate design for computational effectiveness and for maintaining sampling distribution for other concepts. Extensive evaluations demonstrate that our proposed method achieves effective concept removal meanwhile maintaining high-quality generations for other concepts, and remarkably, without relying on any post-remedial compensation for the first time.
摘要：扩散模型强大的生成能力引起了人们对生成敏感或不需要的内容的日益增长的隐私和安全担忧。为此，机器学习（MU）——通常在扩散模型中被称为概念擦除（CE）——被引入以从模型参数中删除特定知识，同时保留无辜知识。尽管最近取得了进展，但现有的遗忘方法常常遭受过度和不加区别的去除，这导致无辜一代人的质量大幅下降。为了保持模型效用，先前的工作依赖于补偿，即重新同化剩余数据的子集或明确限制与预训练模型在剩余概念上的分歧。然而，我们发现，超出补偿范围的几代人仍然受到影响，这表明这种补救后的补偿本质上不足以维持大规模生成模型的普遍效用。因此，在本文中，我们主张发展无补偿的概念擦除操作，精确识别并消除不需要的知识，从而将对其他几代人的影响降到最低。在技术上，我们提出MiM-MU，即通过精细设计最小化互信息来忘记一个概念，以提高计算效率并维持其他概念的采样分布。广泛的评估表明，我们提出的方法实现了有效的概念去除，同时保持了其他概念的高质量生成，并且值得注意的是，第一次不需要依赖任何补救后补偿。

Title: Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

Authors: Yuze Li, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li, Lei Zhou, Yun Sing Koh, Cheng Yan, Xinyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01000
Pdf URL: https://arxiv.org/pdf/2603.01000
Copy Paste: [[2603.01000]] Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer(https://arxiv.org/abs/2603.01000)
Keywords: generation
Abstract: Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
摘要：运动传输已成为可控视频生成的一个有前途的方向，但现有方法主要集中于单个对象场景，并且当多个对象需要不同的运动模式时会遇到困难。在这项工作中，我们提出了 FlexiMMT，这是第一个隐式图像到视频 (I2V) 运动传输框架，它显式地支持多对象、多运动传输。给定静态多对象图像和多个参考视频，FlexiMMT 独立提取运动表示并将其准确地分配给不同的对象，支持灵活的重组和任意运动到对象的映射。为了解决跨对象运动纠缠的核心挑战，我们引入了运动解耦掩模注意力机制，该机制使用特定于对象的掩模来约束注意力，确保运动和文本标记仅影响其指定区域。我们进一步提出了一种差异化掩模传播机制，该机制直接从扩散注意力中导出特定于对象的掩模，并逐步有效地在帧之间传播它们。大量实验表明，FlexiMMT 在基于 I2V 的多目标多运动传输中实现了精确、组合和最先进的性能。

Title: GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

Authors: Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01010
Pdf URL: https://arxiv.org/pdf/2603.01010
Copy Paste: [[2603.01010]] GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis(https://arxiv.org/abs/2603.01010)
Keywords: generation, generative
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
摘要：生成建模的最新进展极大地增强了新颖的视图合成，但保持不同观点的一致性仍然具有挑战性。基于扩散的模型依赖于随机噪声到数据的转换，这会掩盖确定性结构并产生不一致的视图预测。我们提出了一个数据到数据流匹配框架，该框架直接学习成对视图之间的确定性转换，通过显式数据耦合增强视图一致性合成。为了进一步增强几何一致性，我们引入了概率密度测地线流匹配（PDG-FM），它使用从预训练扩散模型的概率密度度量导出的测地线插值来约束流轨迹。这种与数据流形的高密度区域的对齐促进了样本之间更真实的插值。根据经验，我们的方法超越了基于扩散的 NVS 基线，展示了结构一致性的改进和视图之间更平滑的过渡。这些结果凸显了将数据相关的几何正则化合并到确定性流匹配中以实现一致的新颖视图生成的优势。

Title: One-Token Verification for Reasoning Correctness Estimation

Authors: Zhan Zhuang, Xiequn Wang, Zebin Chen, Feiyang Ye, Ying Wei, Kede Ma, Yu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01025
Pdf URL: https://arxiv.org/pdf/2603.01025
Copy Paste: [[2603.01025]] One-Token Verification for Reasoning Correctness Estimation(https://arxiv.org/abs/2603.01025)
Keywords: generation
Abstract: Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or best-of-$N$ decoding. However, two key challenges persist. First, multi-sample decoding incurs substantial inference latency, especially for long-form outputs. Second, effective mechanisms for reliably assessing the correctness of individual reasoning traces are still limited. To address these challenges, we introduce One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation. OTV is activated by a learnable token and integrated into the LLM via low-rank adaptation to probe internal reasoning signals through the key-value cache, supporting token-level correctness estimation at any stage of generation without disrupting primary reasoning. Experiments on mathematical reasoning benchmarks demonstrate that OTV consistently surpasses existing verifiers. Additionally, OTV reduces token usage by up to $90\%$ through correctness-guided early termination, prioritizing shorter, more reliable solutions.
摘要：最近大型语言模型 (LLM) 的突破在解决数学问题等复杂推理任务中取得了显着的成功。提高性能的常见策略是并行思维，其中生成多个推理轨迹，并使用多数投票或最佳 N$ 解码等聚合方案进行最终预测。然而，仍然存在两个关键挑战。首先，多样本解码会导致大量的推理延迟，特别是对于长格式输出。其次，可靠评估个体推理轨迹正确性的有效机制仍然有限。为了应对这些挑战，我们引入了单令牌验证（OTV），这是一种计算方法，可在生成过程中估计单次前向传递中的推理正确性。 OTV 由可学习令牌激活，并通过低秩适应集成到 LLM 中，以通过键值缓存探测内部推理信号，支持在生成的任何阶段进行令牌级正确性估计，而不会中断主要推理。数学推理基准实验表明，OTV 始终超越现有验证器。此外，OTV 通过以正确性为导向的提前终止，优先考虑更短、更可靠的解决方案，将代币使用量减少了高达 90\%$。

Title: Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery

Authors: Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01034
Pdf URL: https://arxiv.org/pdf/2603.01034
Copy Paste: [[2603.01034]] Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery(https://arxiv.org/abs/2603.01034)
Keywords: super-resolution
Abstract: Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at this https URL.
摘要：张量环 (TR) 分解是高阶数据建模的强大工具，但本质上仅限于固定网格上定义的离散形式。在这项工作中，我们提出了网格网格和非网格网格数据的 TR 函数分解，其中因子通过隐式神经表示（INR）进行参数化。然而，优化这个连续框架以捕获精细尺度的细节本质上是困难的。通过频域分析，我们证明 TR 因子的谱结构决定了重构张量的频率组成并限制了高频建模能力。为了缓解这个问题，我们提出了一种重新参数化的 TR 函数分解，其中每个 TR 因子都是可学习的潜在张量和固定基的结构化组合。从理论上讲，这种重新参数化可以改善 TR 因子学习的训练动态。我们进一步推导了固定基础的原则性初始化方案，并证明了我们提出的模型的 Lipschitz 连续性。关于图像修复、去噪、超分辨率和点云恢复的大量实验表明，我们的方法始终优于现有方法。代码可从此 https URL 获取。

Title: From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

Authors: Haoyuan Zhang, Keyao Wang, Guosheng Zhang, Haixiao Yue, Zhiwen Tan, Siran Peng, Tianshuo Zhang, Xiao Tan, Kunbin Chen, Wei He, Jingdong Wang, Ajian Liu, Xiangyu Zhu, Zhen Lei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01038
Pdf URL: https://arxiv.org/pdf/2603.01038
Copy Paste: [[2603.01038]] From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing(https://arxiv.org/abs/2603.01038)
Keywords: generation
Abstract: Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
摘要：人脸识别仍然容易受到演示攻击，因此需要强大的人脸反欺骗 (FAS) 解决方案。最近基于 MLLM 的 FAS 方法将二元分类任务重新表述为生成简短的文本描述，以提高跨域泛化能力。然而，它们的普遍性仍然有限，因为此类描述主要捕获直观的语义线索（例如，掩模轮廓），同时难以感知细粒度的视觉模式。为了解决这一限制，我们将外部视觉工具纳入 MLLM，以鼓励对微妙的欺骗线索进行更深入的调查。具体来说，我们提出了工具增强推理 FAS (TAR-FAS) 框架，它将 FAS 任务重新表述为视觉工具思维链 (CoT-VT) 范式，允许 MLLM 从直观观察开始，并自适应地调用外部视觉工具进行细粒度调查。为此，我们设计了一个工具增强数据注释管道并构建了 ToolFAS-16K 数据集，其中包含多轮工具使用推理轨迹。此外，我们引入了工具感知的 FAS 训练管道，其中多样化工具组相对策略优化 (DT-GRPO) 使模型能够自主学习有效的工具使用。在具有挑战性的一对一跨域协议下进行的大量实验表明，TAR-FAS 实现了 SOTA 性能，同时为可靠的欺骗检测提供细粒度的视觉调查。

Title: Evaluating GFlowNet from partial episodes for stable and flexible policy-based training

Authors: Puhua Niu, Shili Wu, Xiaoning Qian
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01047
Pdf URL: https://arxiv.org/pdf/2603.01047
Copy Paste: [[2603.01047]] Evaluating GFlowNet from partial episodes for stable and flexible policy-based training(https://arxiv.org/abs/2603.01047)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.
摘要：生成流网络（GFlowNets）的开发目的是通过将候选组合的生成过程解释为有向非循环图中的轨迹来学习有效采样组合候选的策略。在基于价值的训练工作流程中，目标是在学习的策略流和期望策略的估计流之间强制实现部分事件的平衡，隐含地鼓励策略分歧最小化。基于策略的策略在估计策略分歧和更新策略之间交替，但有向非循环图下的分歧的可靠估计仍然是一个主要挑战。这项工作通过表明流量平衡还产生了一个衡量差异的原则性政策评估器，并提出了针对部分事件的评估平衡目标来学习评估器，从而弥合了这两种观点。正如综合任务和现实世界任务所证明的那样，评估平衡不仅增强了基于策略的训练的可靠性，而且还通过无缝支持参数化后向策略并实现离线数据收集技术的集成来扩大其灵活性。

Title: MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

Authors: Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01050
Pdf URL: https://arxiv.org/pdf/2603.01050
Copy Paste: [[2603.01050]] MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline(https://arxiv.org/abs/2603.01050)
Keywords: generation
Abstract: We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at this https URL
摘要：我们的目标是开发一种能够进行显式推理和规划、多工具调用和跨模态信息合成的多模态研究代理，使其能够执行深入的研究任务。然而，我们观察到开发此类代理的三个主要挑战：（1）搜索密集型多模式 QA 数据的稀缺，（2）缺乏有效的搜索轨迹，以及（3）在线搜索 API 的培训成本过高。为了解决这些问题，我们首先提出了 Hyper-Search，这是一种基于超图的 QA 生成方法，可以对模态内部和跨模态的视觉和文本节点进行建模和连接，从而能够生成需要调用各种搜索工具来解决的搜索密集型多模态 QA 对。其次，我们引入了DR-TTS，它首先根据搜索工具类型将搜索涉及的任务分解为几类，并分别为每个工具优化专门的搜索工具专家。然后，它重组工具专家，通过树搜索共同探索搜索轨迹，生成使用各种搜索工具成功解决复杂任务的轨迹。第三，我们构建了一个支持多种搜索工具的离线搜索引擎，无需使用昂贵的在线搜索 API 即可实现代理强化学习。通过这三种设计，我们开发了 MM-DeepResearch，一个强大的多模式深度研究代理，广泛的结果显示了它在基准测试中的优越性。代码可在此 https URL 获取

Title: LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

Authors: Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01068
Pdf URL: https://arxiv.org/pdf/2603.01068
Copy Paste: [[2603.01068]] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model(https://arxiv.org/abs/2603.01068)
Keywords: generation
Abstract: We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at this https URL.
摘要：我们提出了 \textbf{LLaDA-o}，一种有效的长度自适应全向扩散模型，用于多模态理解和生成。 LLaDA-o 建立在扩散混合 (MoD) 框架之上，该框架将用于文本理解的离散掩模扩散和用于视觉生成的连续扩散解耦，同时通过共享、简单且高效的注意力主干将它们耦合起来，从而减少固定条件下的冗余计算。在 MoD 的基础上，我们进一步引入了一种以数据为中心的长度自适应策略，该策略可以在多模态设置中实现灵活的长度解码，而无需更改架构。大量实验表明，LLaDA-o 在多模态理解和生成基准上实现了全向扩散模型中最先进的性能，并且在文本到图像生成的 DPG-Bench 上达到了 87.04，支持了统一全向扩散建模的有效性。代码可从此 https URL 获取。

Title: Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

Authors: Arctanx An, Shizhao Sun, Danqing Huang, Mingxi Cheng, Yan Gao, Ji Li, Yu Qiao, Jiang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01083
Pdf URL: https://arxiv.org/pdf/2603.01083
Copy Paste: [[2603.01083]] Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective(https://arxiv.org/abs/2603.01083)
Keywords: quality assessment
Abstract: Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design this http URL, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{this https URL}{this https URL}
摘要：评估图形设计的美学质量是视觉传达的核心，但在视觉语言模型 (VLM) 中仍未得到充分探索。我们研究 VLM 是否能够以与人类类似的方式评估设计美学。先前的工作面临三个关键限制：基准仅限于狭窄的原则和粗略的评估协议、缺乏系统的 VLM 比较以及用于模型改进的训练数据有限。在这项工作中，我们引入了 AesEval-Bench，这是一个涵盖四个维度、十二个指标和三个完全可量化任务的综合基准：审美判断、区域选择和精确定位。然后，我们系统地评估专有的、开源的和推理增强的 VLM，揭示与美学评估的微妙要求之间的明显性能差距。此外，我们构建了一个训练数据集来微调该领域的 VLM，利用人工引导的 VLM 标签大规模生成任务标签，并利用基于指标的推理将抽象指标与具体设计联系起来。这个 http URL，我们的工作为图形设计中的美学质量评估建立了第一个系统框架。我们的代码和数据集将发布在：\href{this https URL}{this https URL}

Title: Understanding LoRA as Knowledge Memory: An Empirical Analysis

Authors: Seungju Back, Dongwoo Lee, Naun Kang, Taehee Lee, S. K. Hong, Youngjune Gwon, Sungjin Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01097
Pdf URL: https://arxiv.org/pdf/2603.01097
Copy Paste: [[2603.01097]] Understanding LoRA as Knowledge Memory: An Empirical Analysis(https://arxiv.org/abs/2603.01097)
Keywords: generation
Abstract: Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.
摘要：预训练大型语言模型 (LLM) 的持续知识更新越来越必要，但仍然具有挑战性。尽管上下文学习 (ICL) 和检索增强生成 (RAG) 等推理时间方法很流行，但它们面临上下文预算、成本和检索碎片方面的限制。与这些依赖于上下文的范例不同，这项工作研究了一种使用低秩适应（LoRA）作为模块化知识记忆的参数化方法。尽管最近很少有著作研究这个概念，但控制其容量和可组合性的基本机制在很大程度上仍未被探索。我们通过第一个系统的实证研究来弥合这一差距，该研究映射了基于 LoRA 的存储器的设计空间，范围从表征存储容量和优化内化到扩展多模块系统和评估长上下文推理。我们不是提出单一架构，而是提供有关 LoRA 内存操作边界的实用指导。总体而言，我们的研究结果将 LoRA 定位为与 RAG 和 ICL 一起的记忆轴的补充，具有明显的优势。

Title: Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting

Authors: Dantong Qin, Alessandro Bozzon, Xian Yang, Xun Zhang, Yike Guo, Pan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01103
Pdf URL: https://arxiv.org/pdf/2603.01103
Copy Paste: [[2603.01103]] Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting(https://arxiv.org/abs/2603.01103)
Keywords: generation, generative
Abstract: Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.
摘要：许多创意多媒体系统都是建立在笔划或纹理等视觉基元的基础上的，这些视觉基元很难大规模收集，并且与自然图像数据根本不同。这种数据稀缺使得现代生成模型难以学习富有表现力和可控的基元，从而限制了它们在流程感知内容创建中的使用。在这项工作中，我们研究了从一小组手绘样本 (n=470) 中学习类人笔触生成的问题，并提出了 StrokeDiff，一种具有平滑正则化 (SmR) 的基于扩散的框架。 SmR 在训练过程中注入随机视觉先验，提供了一种简单的机制，可以在稀疏监督下稳定扩散模型，而无需改变推理过程。我们进一步展示了如何通过基于贝塞尔曲线的调节模块使学习到的基元变得可控，并集成到完整的基于笔划的绘画管道中，包括预测、生成、排序和合成。这演示了数据高效的原始建模如何支持富有表现力和结构化的多媒体内容创建。实验表明，所提出的方法可以产生多样化且结构连贯的笔触，并使绘画具有更丰富的纹理和层次感，并通过自动指标和人工评估进行验证。

Title: ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

Authors: Xiwei Liu, Yulong Li, Xinlin Zhuang, Xuhui Li, Jianxu Chen, Haolin Yang, Imran Razzak, Yutong Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01124
Pdf URL: https://arxiv.org/pdf/2603.01124
Copy Paste: [[2603.01124]] ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models(https://arxiv.org/abs/2603.01124)
Keywords: generation
Abstract: Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
摘要：医学视觉语言模型在临床决策支持方面显示出巨大的潜力，但由于局部病理证据的基础不足，它们仍然容易产生事实幻觉。现有的医学对齐方法主要通过偏好优化在响应级别上运行，提高输出的正确性，但使中间推理与视觉区域的联系较弱。尽管思维链（CoT）增强了多模式推理，但它仍然主要以文本为中心，限制了临床视觉线索的有效整合。为了解决这一差距，我们提出了 ClinCoT，这是一种临床感知的视觉思维链框架，可将偏好优化从响应级别校正转变为视觉驱动推理。我们引入了一种自动数据生成管道，该管道通过假设驱动的区域建议进行推理来构建临床基础的偏好对。多个 Med-LLM 评估者对每个响应进行排名并分配分数，这些排名作为训练目标模型的监督。我们进一步引入了一种基于评分的边缘感知优化策略，该策略结合了偏好排名和分数差异来细化区域级推理轨迹。为了在训练过程中模型策略的发展保持一致，我们采用了动态重新生成偏好数据的迭代学习方案。对三个医学 VQA 和报告生成基准的广泛实验表明，与现有的基于偏好的对齐方法相比，ClinCoT 持续改进事实基础并实现卓越的性能。

Title: Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers

Authors: Kuai Jiang, Zhaoyan Ding, Guijuan Zhang, Dianjie Lu, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01140
Pdf URL: https://arxiv.org/pdf/2603.01140
Copy Paste: [[2603.01140]] Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers(https://arxiv.org/abs/2603.01140)
Keywords: generation, generative
Abstract: Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
摘要：传统的图像去噪模型经常会无意中学习环境因素和噪声模式之间的虚假相关性。此外，由于高频模糊性，他们很难可靠地区分细微纹理和随机噪声，从而导致过度去除细节或残留噪声伪影。因此，我们重新审视通过因果干预的去噪，认为纯粹的相关拟合将内在内容与外在噪声纠缠在一起，这直接降低了分布变化下的鲁棒性。受此启发，我们提出了教师引导的因果解开网络（TCD-Net），它通过 Vision Transformer 框架内对特征空间的结构化干预来明确分解生成机制。具体来说，我们的方法集成了三个关键组件：（1）环境偏差调整（EBA）模块将特征投影到稳定的、去中心的子空间中，以抑制全局环境偏差（去混杂）。 (2)双分支解缠头采用正交约束来强制内容和噪声表示之间的严格分离，防止信息泄漏。 (3) 为了解决结构模糊性，我们利用 Nano Banana Pro（Google 的推理引导 AI 图像生成模型）来引导因果先验，有效地将内容表示拉回到自然图像流形上。大量实验表明，TCD-Net 在保真度和效率方面均优于多个基准测试中的主流方法，在单个 RTX 5090 GPU 上实现了 104.2 FPS 的实时速度。

Title: ArtLLM: Generating Articulated Assets via 3D LLM

Authors: Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01142
Pdf URL: https://arxiv.org/pdf/2603.01142
Copy Paste: [[2603.01142]] ArtLLM: Generating Articulated Assets via 3D LLM(https://arxiv.org/abs/2603.01142)
Keywords: generative
Abstract: Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
摘要：为游戏、机器人和模拟创建交互式数字环境依赖于铰接的 3D 对象，其功能源自其零件几何形状和运动结构。然而，现有方法仍然受到根本限制：基于优化的重建方法需要缓慢的每个对象关节拟合，并且通常仅处理简单的单关节对象，而基于检索的方法从固定库中组装零件，导致重复的几何形状和较差的泛化能力。为了应对这些挑战，我们引入了 ArtLLM，这是一种新颖的框架，用于直接从完整的 3D 网格生成高质量的铰接资产。其核心是一个 3D 多模态大语言模型，该模型在由现有发音数据集和程序生成的对象组成的大规模发音数据集上进行训练。与之前的工作不同，ArtLLM 自回归预测可变数量的零件和关节，从对象的点云以统一的方式推断它们的运动结构。然后，这种关节感知布局会调节 3D 生成模型来合成高保真零件几何形状。 PartNet-Mobility 数据集上的实验表明，ArtLLM 在零件布局精度和关节预测方面均显着优于最先进的方法，同时能够稳健地推广到现实世界的对象。最后，我们展示了它在构建数字孪生方面的实用性，强调了它在可扩展机器人学习方面的潜力。

Title: Operator Learning Using Weak Supervision from Walk-on-Spheres

Authors: Hrishikesh Viswanath, Hong Chul Nam, Xi Deng, Julius Berner, Anima Anandkumar, Aniket Bera
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01193
Pdf URL: https://arxiv.org/pdf/2603.01193
Copy Paste: [[2603.01193]] Operator Learning Using Weak Supervision from Walk-on-Spheres(https://arxiv.org/abs/2603.01193)
Keywords: generation
Abstract: Training neural PDE solvers is often bottlenecked by expensive data generation or unstable physics-informed neural network (PINN) that involves challenging optimization landscapes due to higher-order derivatives. To tackle this issue, we propose an alternative approach using Monte Carlo approaches to estimate the solution to the PDE as a stochastic process for weak supervision during training. Leveraging the walk-on-spheres method, we introduce a learning scheme called \emph{Walk-on-Spheres Neural Operator (WoS-NO)} which uses weak supervision from WoS to train any given neural operator. We propose to amortize the cost of Monte Carlo walks across the distribution of PDE instances using stochastic representations from the WoS algorithm to generate cheap, noisy, estimates of the PDE solution during training. This is formulated into a data-free physics-informed objective where a neural operator is trained to regress against these weak supervisions, allowing the operator to learn a generalized solution map for an entire family of PDEs. This strategy results in a mesh-free framework that operates without expensive pre-computed datasets, avoids the need for computing higher-order derivatives for loss functions that are memory-intensive and unstable, and demonstrates zero-shot generalization to novel PDE parameters and domains. Experiments show that for the same number of training steps, our method exhibits up to 8.75$\times$ improvement in $L_2$-error compared to standard physics-informed training schemes, up to 6.31$\times$ improvement in training speed, and reductions of up to 2.97$\times$ in GPU memory consumption. We present the code at this https URL
摘要：训练神经 PDE 求解器通常会受到昂贵的数据生成或不稳定的物理信息神经网络 (PINN) 的瓶颈，其中涉及由于高阶导数而带来的具有挑战性的优化环境。为了解决这个问题，我们提出了一种替代方法，使用蒙特卡罗方法来估计偏微分方程的解，作为训练期间弱监督的随机过程。利用walk-on-spheres方法，我们引入了一种名为\emph{Walk-on-Spheres Neural Operator (WoS-NO)}的学习方案，它使用WoS的弱监督来训练任何给定的神经算子。我们建议使用 WoS 算法的随机表示来摊销蒙特卡洛遍历 PDE 实例分布的成本，以在训练期间生成 PDE 解的廉价、有噪声的估计。这被制定为一个无数据的物理信息目标，其中神经算子经过训练以针对这些弱监督进行回归，从而使算子能够学习整个偏微分方程系列的广义解图。这种策略产生了一个无网格框架，该框架无需昂贵的预计算数据集即可运行，避免了计算内存密集型且不稳定的损失函数的高阶导数的需要，并演示了对新的偏微分方程参数和域的零样本泛化。实验表明，对于相同数量的训练步骤，与标准物理训练方案相比，我们的方法在 $L_2$ 错误方面表现出高达 8.75$\times$ 的改进，训练速度高达 6.31$\times$ 的改进，以及 GPU 内存消耗高达 2.97$\times$ 的减少。我们在此 https URL 处提供代码

Title: RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

Authors: Mochu Xiang, Zhelun Shen, Xuesong Li, Jiahui Ren, Jing Zhang, Chen Zhao, Shanshan Liu, Haocheng Feng, Jingdong Wang, Yuchao Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01194
Pdf URL: https://arxiv.org/pdf/2603.01194
Copy Paste: [[2603.01194]] RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations(https://arxiv.org/abs/2603.01194)
Keywords: generation
Abstract: Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: this https URL
摘要：人类通过有限视角的 2D 观察来感知 3D 世界。虽然最近的前馈广义 3D 重建模型擅长从稀疏图像中恢复 3D 结构，但它们的表示通常仅限于观察到的区域，从而无法对看不见的几何图形进行建模。这就提出了一个关键的、根本性的挑战：我们能否从部分 2D 观察中推断出完整的 3D 结构？我们提出了 RnG（重建和生成），这是一种新颖的前馈 Transformer，它通过预测隐式的完整 3D 表示来统一这两个任务。在 RnG 的核心，我们提出了一种重建引导的因果注意机制，该机制在注意级别将重建和生成分开，并将 KV 缓存视为隐式 3D 表示。然后，任意姿势都可以有效地查询该缓存，以渲染高保真、新颖视图的 RGBD 输出。结果，RnG 不仅准确地重建了可见的几何形状，而且还生成了合理的、连贯的不可见的几何形状和外观。我们的方法在通用 3D 重建和新颖视图生成方面都实现了最先进的性能，同时对于实时交互式应用程序来说运行效率足够高。项目页面：此 https URL

Title: JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

Authors: Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.01291
Pdf URL: https://arxiv.org/pdf/2603.01291
Copy Paste: [[2603.01291]] JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks(https://arxiv.org/abs/2603.01291)
Keywords: generation
Abstract: Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at this https URL.
摘要：假新闻破坏政治、经济、健康和国际关系方面的社会信任和决策，在极端情况下威胁人类生命和社会安全。由于假新闻反映了特定地区的政治、社会和文化背景，并以语言表达，因此评估大语言模型（LLM）的风险需要多语言和地区视角。恶意用户可以通过越狱攻击绕过保护措施，诱导法学硕士生成假新闻。然而，目前还没有基准可以系统地评估跨语言和跨地区的攻击弹性。在这里，我们提出了 JailNewsBench，这是评估 LLM 对越狱引起的假新闻生成的鲁棒性的第一个基准。 JailNewsBench 跨越 34 个地区和 22 种语言，涵盖通过 LLM-as-a-Judge 的 8 个评估子指标和 5 个越狱攻击，约有 30 万个实例。我们对 9 个 LLM 的评估显示，最大攻击成功率 (ASR) 达到 86.3%，最大危害得分为 3.5 分（满分 5 分）。值得注意的是，对于英语和美国相关主题，典型多语言 LLM 的防御性能明显低于其他地区，凸显了跨语言和地区安全性的严重不平衡。此外，我们的分析表明，现有安全数据集中假新闻的覆盖范围有限，而且不像毒性和社会偏见等主要类别那样得到很好的保护。我们的数据集和代码可在此 https URL 获取。

Title: You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image

Authors: Taoyue Wang, Xiang Zhang, Xiaotian Li, Huiyuan Yang, Lijun Yin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01328
Pdf URL: https://arxiv.org/pdf/2603.01328
Copy Paste: [[2603.01328]] You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image(https://arxiv.org/abs/2603.01328)
Keywords: generative
Abstract: We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.
摘要：我们提出了一种新颖的单阶段方法 NVB-Face，用于直接从单个盲人脸部图像生成一致的新颖视图图像。现有的对象或面部新颖视图合成方法通常需要高分辨率 RGB 图像作为输入。在处理退化图像时，传统的流程遵循两个阶段的过程：首先将图像恢复为高分辨率，然后从恢复的结果合成新的视图。然而，这种方法高度依赖于恢复图像的质量，通常会导致最终输出的不准确和不一致。为了解决这个限制，我们直接从盲人脸部图像中提取单视图特征，并引入一个特征操纵器，将这些特征转换为 3D 感知的多视图潜在表示。利用扩散模型强大的生成能力，我们的框架合成了高质量、一致的新颖视角人脸图像。实验结果表明，我们的方法在一致性和保真度方面显着优于传统的两阶段方法。

Title: DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking

Authors: Gilad Turok, Chris De Sa, Volodymyr Kuleshov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01367
Pdf URL: https://arxiv.org/pdf/2603.01367
Copy Paste: [[2603.01367]] DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking(https://arxiv.org/abs/2603.01367)
Keywords: generative
Abstract: Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textsc{DUEL} framework, which formalizes \emph{deterministic} position selection, unifying leading MDM sampling strategies. We prove \textbf{\textsc{DUEL} admits \emph{exact} likelihood computation} via a simple algorithm, evaluated under the same position selection used at test time. This \textbf{gives MDMs proper perplexity for the first time} -- the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbf{MDMs are substantially better than previously thought}: the MDM-autoregressive perplexity gap shrinks by up to 32\% on in-domain data and 82\% on zero-shot benchmarks. \textsc{DUEL} enables the first principled comparison of fast, parallel samplers across compute budgets -- an analysis impossible with the ELBO and unreliable with generative perplexity -- identifying probability margin \citep{kim2025train} as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models -- achieving 36.47 vs.\ 52.11 perplexity on AG News -- demonstrating the ceiling of MDM performance has not yet been reached.
摘要：屏蔽扩散模型 (MDM) 通过迭代选择要取消屏蔽的位置，然后预测这些位置的标记来生成文本。然而，MDM 缺乏适当的困惑度评估：ELBO 是训练分布（而不是测试时间分布）下可能性的松散界限，而生成困惑度需要有偏差的外部模型并忽略多样性。为了解决这个问题，我们引入了 \textsc{DUEL} 框架，该框架将 \emph{确定性} 位置选择形式化，统一了领先的 MDM 采样策略。我们通过一个简单的算法证明 \textbf{\textsc{DUEL} 承认 \emph{exact} 似然计算}，并在测试时使用的相同位置选择下进行评估。这 \textbf{第一次给 MDM 带来了适当的困惑}——自回归困惑的自然模拟。带着适当的困惑，我们重新审视有关 MDM 的关键问题。 \textbf{MDM 比之前想象的要好得多}：MDM 自回归困惑度差距在域内数据上缩小了 32%，在零样本基准上缩小了 82%。 \textsc{DUEL} 实现了跨计算预算的快速、并行采样器的第一个原则性比较 - ELBO 不可能进行分析，并且生成性困惑不可靠 - 将概率裕度 \citep{kim2025train} 识别为强默认值。最后，对头寸排序的预言机搜索显示，MDM 可以远远超过自回归模型——在 AG News 上达到 36.47 vs.\ 52.11 的困惑度——这表明 MDM 性能的上限尚未达到。

Title: TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity

Authors: Xiao Cai, Lianli Gao, Pengpeng Zeng, Ji Zhang, Heng Tao Shen, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01371
Pdf URL: https://arxiv.org/pdf/2603.01371
Copy Paste: [[2603.01371]] TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity(https://arxiv.org/abs/2603.01371)
Keywords: generation
Abstract: Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.
摘要：图像到 3D 多实例生成中的精确空间保真度对于下游实际应用至关重要。最近的工作试图通过在多实例数据集上微调预训练的 Image-to-3D (I23D) 模型来解决这个问题，但这会产生大量的训练开销，并且很难保证空间保真度。事实上，我们观察到预训练的 I23D 模型已经拥有有意义的空间先验，但实例纠缠问题证明这些先验仍未得到充分利用。受此启发，我们提出了 TIMI，这是一种新颖的免训练框架，用于图像到 3D 多实例生成，可实现高空间保真度。具体来说，我们首先引入实例感知分离指导（ISG）模块，该模块有助于早期去噪阶段的实例解开。接下来，为了稳定 ISG 引入的指导，我们设计了一个空间稳定几何自适应更新（SGU）模块，该模块可以促进保存实例的几何特征，同时保持它们的相对关系。大量的实验表明，与现有的多实例方法相比，我们的方法在全局布局和不同的本地实例方面产生了更好的性能，并且不需要额外的训练并且具有更快的推理速度。

Title: Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis

Authors: Junwei Zeng, Dong Liang, Sheng-Jun Huang, Kun Zhan, Songcan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01398
Pdf URL: https://arxiv.org/pdf/2603.01398
Copy Paste: [[2603.01398]] Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis(https://arxiv.org/abs/2603.01398)
Keywords: restoration
Abstract: Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: this http URL.
摘要：大气湍流会引入几何扭曲和与曝光时间相关的模糊，从而显着降低远距离成像质量，从而对视觉质量和高级视觉任务的性能产生不利影响。现有的合成湍流效果的方法通常过于简化模糊和曝光时间之间的关系，通常假设固定或二元曝光设置。这导致合成数据不切实际，训练模型的泛化能力有限。为了解决这一差距，我们重新审视了调制传递函数 (MTF) 公式，并提出了一种新颖的曝光时间相关 MTF (ET-MTF)，它将模糊建模为曝光时间的连续函数。对于模糊合成，我们从 ET-MTF 导出倾斜不变点扩散函数 (PSF)，当与空间变化的模糊宽度场集成时，它可以对湍流引起的模糊提供全面且物理上准确的表征。在此合成管道的基础上，我们构建了 ET-Turb，这是一个大规模合成湍流数据集，它明确地结合了跨不同光学和大气条件的连续曝光时间建模。该数据集包含 5,083 个视频（2,005,835 帧），分为 3,988 个训练视频和 1,095 个测试视频。大量实验表明，与在其他数据集上训练的模型相比，在 ET-Turb 上训练的模型可以产生更真实的恢复，并在现实世界的湍流数据上实现出色的泛化。该数据集可在以下网址公开获取：此 http URL。

Title: UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Authors: Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.01418
Pdf URL: https://arxiv.org/pdf/2603.01418
Copy Paste: [[2603.01418]] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation(https://arxiv.org/abs/2603.01418)
Keywords: generation
Abstract: While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
摘要：虽然 Veo3 和 Sora2 等最先进的音视频生成模型展示了卓越的功能，但它们的闭源性质使其架构和训练范例难以访问。为了弥补可访问性和性能方面的差距，我们引入了 UniTalking，这是一个统一的端到端扩散框架，用于生成高保真语音和口型同步视频。其核心是，我们的框架采用多模态转换器块，通过共享的自注意力机制对音频和视频潜在标记之间的细粒度时间对应关系进行显式建模。通过利用预先训练的视频生成模型的强大先验，我们的框架确保了最先进的视觉保真度，同时实现高效的训练。此外，UniTalking 还集成了个性化语音克隆功能，允许根据简短的音频参考生成目标风格的语音。定性和定量结果表明，我们的方法可以生成高度逼真的谈话肖像，在口型同步准确性、音频自然度和整体感知质量方面比现有开源方法具有更优越的性能。

Title: DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

Authors: Zengqi Zhao, Weidi Xia, Peter Wei, Yan Zhang, Yiyi Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Simiao Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01433
Pdf URL: https://arxiv.org/pdf/2603.01433
Copy Paste: [[2603.01433]] DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis(https://arxiv.org/abs/2603.01433)
Keywords: generative
Abstract: We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
摘要：我们提出了 DOCFORGE-BENCH，这是第一个用于文档伪造检测的统一零样本基准，评估了 8 个数据集的 14 种方法，涵盖文本篡改、收据伪造和身份证件操作。与 ForensicHub [Du et al., 2025] 等面向微调的评估不同，DOCFORGE-BENCH 应用所有方法及其已发布的预训练权重，并且没有域适应——这是一种深思熟虑的设计选择，反映了从业者缺乏标记文档训练数据的实际部署场景。我们的主要发现是在单阈值协议下看不见的普遍校准失败：方法实现中等 Pixel-AUC (>=0.76) 但接近零 Pixel-F1。这种 AUC-F1 差距不是判别失败，而是分数分布变化：被篡改的区域仅占文档图像中像素的 0.27-4.17%——比自然图像基准低一个数量级——使得标准 tau=0.5 阈值发生灾难性的错误校准。 Oracle-F1 比固定阈值 Pixel-F1 高 2-10 倍，这证实了校准而不是表示才是瓶颈。受控校准实验验证了这一点：在 N=10 域图像上调整单个阈值可以恢复 Oracle-F1 差距的 39-55%，这表明阈值调整（而不是重新训练）是实际部署中缺少的关键步骤。总体而言，没有一种评估方法能够在不同的文档类型上可靠地开箱即用，这凸显出文档伪造检测仍然是一个未解决的问题。我们进一步注意到，所有八个数据集都早于生成式人工智能编辑时代；涵盖基于扩散和 LLM 的文档伪造的基准代表了现代攻击面的一个关键的开放缺口。

Title: Unifying Language-Action Understanding and Generation for Autonomous Driving

Authors: Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.01441
Pdf URL: https://arxiv.org/pdf/2603.01441
Copy Paste: [[2603.01441]] Unifying Language-Action Understanding and Generation for Autonomous Driving(https://arxiv.org/abs/2603.01441)
Keywords: generation
Abstract: Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
摘要：视觉-语言-动作（VLA）模型正在成为端到端自动驾驶的一种有前景的范例，因其利用世界知识和推理复杂驾驶场景的潜力而受到重视。然而，现有方法存在两个关键限制：语言指令和动作输出之间持续存在不一致，以及典型自回归动作生成固有的低效率。在本文中，我们介绍了 LinkVLA，这是一种新颖的架构，可以直接解决这些挑战，从而增强对齐和效率。首先，我们通过将语言和动作标记统一到共享的离散码本中来建立结构链接，并在单个多模态模型中进行处理。这从结构上从头开始强制执行跨模式一致性。其次，为了创建深层语义链接，我们引入了辅助动作理解目标，该目标训练模型从轨迹生成描述性标题，从而促进双向语言-动作映射。最后，我们用两步从粗到细的生成方法 C2F 代替了缓慢的逐步生成，该方法可以有效地解码动作序列，节省 86% 的推理时间。闭环驾驶基准实验表明，指令跟踪准确性和驾驶性能持续提高，同时推理延迟也减少。

Title: Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

Authors: Thomas Rückstieß, Robin Vujanic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01444
Pdf URL: https://arxiv.org/pdf/2603.01444
Copy Paste: [[2603.01444]] Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data(https://arxiv.org/abs/2603.01444)
Keywords: generation
Abstract: Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.
摘要：合成数据生成是数据共享、隐私合规、系统基准测试和测试数据配置的关键功能。现有方法假设密集、固定模式的表格数据，但这种假设与现代数据系统（从文档数据库、REST API 到数据湖）越来越不一致，现代数据系统以稀疏、半结构化格式（如 JSON）存储和交换数据。将现有的表格方法应用于此类数据需要将嵌套数据展平为扩展性差的宽而稀疏的表。我们提出了 Origami，一种基于自回归变压器的架构，它将数据记录（包括嵌套对象和可变长度数组）标记为键、值和结构标记的序列。这种表示本身可以处理稀疏性、混合类型和层次结构，而无需展平或插补。 Origami 在几乎所有设置中的保真度、效用和检测指标方面均优于涵盖 GAN、VAE、扩散和自回归架构的基线，同时保持较高的隐私分数。在稀疏度高达 38% 的半结构化数据集上，基线合成器要么无法扩展，要么大幅退化，而 Origami 则保持高保真度合成，很难与真实数据区分开来。据我们所知，Origami 是第一个能够本地建模并生成端到端半结构化数据的架构。

Title: Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection

Authors: Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon, Shulan Wang, Kam-Pui Chow, Kwok-Yan Lam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01450
Pdf URL: https://arxiv.org/pdf/2603.01450
Copy Paste: [[2603.01450]] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection(https://arxiv.org/abs/2603.01450)
Keywords: generation
Abstract: The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model's ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at this https URL
摘要：Deepfake 生成技术的快速发展对公共安全构成了重大威胁，并通过创建高度逼真的合成面部媒体造成社会危害。虽然现有的检测方法在推广新兴伪造模式方面存在局限性，但本文提出了 Deepfake Forensics Adapter (DFA)，这是一种新颖的双流框架，可将视觉语言基础模型与目标取证分析相结合。我们的方法将预先训练的 CLIP 模型与三个核心组件集成在一起，通过利用 CLIP 强大的通用功能而不更改 CLIP 参数来实现专门的深度伪造检测：1）全局特征适配器用于识别图像内容中可能表明伪造的全局不一致，2）局部异常流通过显式利用面部结构先验来增强模型感知局部面部伪造线索的能力，3）交互式融合分类器促进全局和之间的深度交互和融合。使用变压器编码器的局部特征。对帧级和视频级基准的广泛评估证明了 DFA 卓越的泛化能力，特别是在具有挑战性的 DFDC 数据集中实现了最先进的性能，帧级 AUC/EER 为 0.816/0.256，视频级 AUC/EER 为 0.836/0.251，视频 AUC 比以前的方法提高了 4.8%。我们的框架不仅展示了最先进的性能，而且还为开发强大的深度伪造检测系统指明了可行且有效的方向，该系统具有增强的泛化能力，以应对不断发展的深度伪造威胁。我们的代码可在此 https URL 获取

Title: Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

Authors: Zillur Rahman, Alex Sheng, Cristian Meo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01509
Pdf URL: https://arxiv.org/pdf/2603.01509
Copy Paste: [[2603.01509]] Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling(https://arxiv.org/abs/2603.01509)
Keywords: generation, generative
Abstract: While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
摘要：虽然大规模数据集推动了文本转视频（T2V）生成模型的重大进展，但这些模型对输入提示仍然高度敏感，这表明提示设计对于生成质量至关重要。当前改进视频输出的方法通常存在不足：它们要么依赖于复杂的后期编辑模型，存在引入伪影的风险，要么需要对核心生成器进行昂贵的微调，这严重限制了可扩展性和可访问性。在这项工作中，我们介绍了 3R，一种新颖的基于 RAG 的提示优化框架。 3R 利用当前最先进的 T2V 扩散模型和视觉语言模型的力量。它可以与任何 T2V 模型一起使用，无需任何类型的模型训练。该框架利用三个关键策略：基于 RAG 的修饰符提取以丰富上下文基础，基于扩散的偏好优化以将输出与人类偏好对齐，以及时间帧插值以生成时间一致的视觉内容。这些组件共同实现了更准确、更高效且与上下文相符的文本到视频的生成。实验结果证明了 3R 在增强生成视频的静态保真度和动态连贯性方面的功效，强调了优化用户提示的重要性。

Title: FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Authors: Hanxiao Wang, Yuan-Chen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01515
Pdf URL: https://arxiv.org/pdf/2603.01515
Copy Paste: [[2603.01515]] FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation(https://arxiv.org/abs/2603.01515)
Keywords: generation
Abstract: Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
摘要：用于 3D 网格生成的自回归模型存在一个基本限制：它们将网格展平为长顶点坐标序列。这导致计算成本过高，阻碍了高保真几何的有效合成。我们认为这个瓶颈源于错误的语义级别的操作。我们引入了 FACE，这是一种新颖的自回归自动编码器 (ARAE) 框架，它通过在面部级别生成网格来重新概念化任务。我们的一面一令牌策略将每个三角形面（网格的基本构建块）视为一个统一的令牌。这种简单而强大的设计将序列长度减少了九倍，从而实现了前所未有的 0.11 压缩比，将之前最先进的技术减半。这种显着的效率提升并不会影响质量；通过将我们的面部级解码器与强大的 VecSet 编码器配对，FACE 在标准基准上实现了最先进的重建质量。通过训练实现高保真、单图像到网格生成的潜在扩散模型，进一步证明了所学习的潜在空间的多功能性。 FACE 提供了一个简单、可扩展且功能强大的范例，降低了创建高质量结构化 3D 内容的障碍。

Title: Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing

Authors: Zijin Yin, Bing Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01535
Pdf URL: https://arxiv.org/pdf/2603.01535
Copy Paste: [[2603.01535]] Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing(https://arxiv.org/abs/2603.01535)
Keywords: generation, generative
Abstract: Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.
摘要：语义分割在自动驾驶和医学图像分析等各种应用中发挥着关键作用。在实践中部署分割模型时，提前测试其在各种复杂场景中的行为至关重要。在本文中，我们构建了一个自动数据生成管道 Gen4Seg，通过生成具有不同属性变化的各种具有挑战性的样本来对语义分割模型进行压力测试。除了以前仅关注全球天气和风格转移的评估范式之外，我们还研究了对象和图像级别的外观和几何属性的变化。其中包括对象颜色、材质、大小、位置以及图像级别的变化，例如天气和风格。为了实现这一目标，我们建议在扩散模型的支持下，通过精确控制结构信息来编辑现有真实图像的视觉属性。这样，编辑后的图像就可以重复使用现有的分割标签，大大降低了人力成本。使用我们的管道，我们构建了两个新的基准：Pascal-EA 和 COCO-EA。我们对各种语义分割模型进行了基准测试，从封闭集模型到开放词汇大型模型。我们有几个关键发现：1）与几何变化下的封闭集方法相比，先进的开放词汇模型并没有表现出更大的鲁棒性； 2）数据增强技术，例如CutOut和CutMix，在增强针对外观变化的鲁棒性方面受到限制； 3）我们的管道还可以用作数据增强工具，并提高分布内和分布外的性能。我们的工作表明生成模型作为自动分析分割模型的有效工具的潜力，我们希望我们的研究结果将帮助从业者和研究人员开发更强大和可靠的分割模型。

Title: RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry

Authors: Xinchang Wang, Yunhao Chen, Yuechen Zhang, Congcong Bian, Zihao Guo, Xingjun Ma, Hui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01544
Pdf URL: https://arxiv.org/pdf/2603.01544
Copy Paste: [[2603.01544]] RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry(https://arxiv.org/abs/2603.01544)
Keywords: generative
Abstract: Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.
摘要：最近的图像生成器生成的逼真内容会破坏下游识别系统的可靠性。随着视觉外观线索变得不那么明显，依赖于取证线索或高级表示的外观驱动检测器会失去稳定性。这促使人们从外观转向行为，重点关注图像如何响应受控扰动，而不是它们的外观。在这项工作中，我们确定了一个简单且普遍的行为信号。自然图像在小的结构化扰动下保持稳定的语义表示，而生成的图像表现出明显更大的特征漂移。我们将这种现象称为鲁棒性不对称，并提供了理论分析，建立了将这种不对称与生成模型中的记忆倾向联系起来的下界，解释了它在架构中的普遍性。基于这一见解，我们引入了鲁棒性不对称检测（RA-Det），这是一种行为驱动的检测框架，可将鲁棒性不对称转换为可靠的决策信号。通过对 14 种不同的生成模型和 10 多个强大的检测器进行评估，RA-Det 实现了卓越的性能，将平均性能提高了 7.81%。该方法与数据和模型无关，不需要生成器指纹，并且可以在看不见的生成器之间进行传输。总之，这些结果表明，鲁棒性不对称性是合成图像检测的稳定、通用线索，而精心设计的探测可以将这种线索转变为实用的通用检测器。源代码可在 Github 上公开获取。

Title: Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder

Authors: Ayantika Das, Keerthi Ram, Mohanasankar Sivaprakasam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01552
Pdf URL: https://arxiv.org/pdf/2603.01552
Copy Paste: [[2603.01552]] Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder(https://arxiv.org/abs/2603.01552)
Keywords: generation, generative
Abstract: Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer's. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer's disease progression.
摘要：基于生成人工智能框架的纵向人脑图像建模和预测提供了一种跟踪神经退行性进展的有效机制，这对于评估阿尔茨海默病等疾病至关重要。在现有的生成方法中，最近基于扩散的模型已成为生成疾病进展图像的有效替代方法。将多模态和非成像属性作为条件信息纳入扩散框架已被证明可以提高此类生成过程中的可控性。然而，现有方法并没有明确确保来自非成像调节模式的信息与图像特征有意义地对齐，以在生成的图像中引入所需的变化，例如对进展特定区域的调制。此外，通过将与进展相关的结构引入到模型的内部表示中，可以实现对生成过程的更精确的控制，这是现有方法所缺乏的。为了解决这些限制，我们提出了一种基于扩散自动编码器的疾病进展建模框架，该框架明确地强制不同模式之间的一致性。通过引入明确的目标函数来强制对齐，该目标函数使模型能够专注于表现出进展相关变化的区域。此外，我们设计了一种机制来更好地构建扩散自动编码框架的潜在表示空间。具体来说，我们分配单独的潜在子空间来整合进展相关条件并保留特定于受试者的身份信息，从而更好地控制图像生成。这些结果表明，强制对齐和更好地构建扩散自动编码框架的潜在表示空间可以使阿尔茨海默氏病进展的模型在解剖学上更加精确。

Title: LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Authors: Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01563
Pdf URL: https://arxiv.org/pdf/2603.01563
Copy Paste: [[2603.01563]] LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models(https://arxiv.org/abs/2603.01563)
Keywords: generation
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
摘要：具有可验证奖励的强化学习（RLVR）在改进自回归模型方面取得了显着的成功，特别是在数学推理和代码生成等需要正确性的领域。然而，直接将此类范式应用于扩散大语言模型（dLLM）从根本上受到精确似然计算的棘手性的阻碍，这迫使现有方法依赖于高方差近似。为了弥补这一差距，我们提出了无似然策略优化（LFPO），这是一个将向量场流匹配的概念映射到离散标记空间的原生框架。具体来说，LFPO 将对齐公式化为几何速度校正，通过对比更新直接优化去噪 logits。这种设计有效地绕过了似然近似中固有的误差，产生了精确的梯度估计。此外，LFPO 通过从中间步骤预测最终解决方案来强制一致性，有效地理顺概率流，以显着减少迭代次数实现高质量生成。大量实验表明，LFPO 不仅在代码和推理基准方面优于最先进的基线，而且还通过减少扩散步骤将推理速度提高了约 20%。

Title: SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

Authors: Chuqiao Wu, Jin Song, Yiyun Fei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01579
Pdf URL: https://arxiv.org/pdf/2603.01579
Copy Paste: [[2603.01579]] SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis(https://arxiv.org/abs/2603.01579)
Keywords: generative
Abstract: Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.
摘要：对于当前的生成模型来说，在现有场景中生成逼真且结构合理的人类图像仍然是一个重大挑战，这些模型经常会产生扭曲的四肢和不自然的姿势等伪影。我们将这种系统性失败归因于无法对人类骨骼结构进行明确的推理。为了解决这个问题，我们引入了 SkeleGuide，这是一个基于显式骨骼推理的新颖框架。通过推理和渲染阶段的联合训练，SkeleGuide 学会生成一个内部姿势，作为强大的结构先验，指导合成实现高度结构完整性。为了实现细粒度的用户控制，我们引入了 PoseInverter，该模块可将内部潜在姿势解码为显式且可编辑的格式。大量实验表明，SkeleGuide 在生成高保真、上下文感知的人体图像方面明显优于专用模型和通用模型。我们的工作提供了令人信服的证据，表明显式建模骨骼结构是实现稳健且合理的人类图像合成的基本步骤。

Title: Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference

Authors: Jiaqi Leng, Shuyuan Tu, Haidong Cao, Sicheng Xie, Daoguo Dong, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01594
Pdf URL: https://arxiv.org/pdf/2603.01594
Copy Paste: [[2603.01594]] Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference(https://arxiv.org/abs/2603.01594)
Keywords: generation
Abstract: Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.
摘要：人类偏好对齐对文本到 3D 生成中的扩散模型提出了一个关键但尚未充分探索的挑战。现有的解决方案通常需要针对特定任务进行微调，这在数据稀缺的 3D 领域构成了重大障碍。为了解决这个问题，我们提出了偏好分数蒸馏（PSD），这是一种基于优化的框架，利用预训练的 2D 奖励模型进行人类对齐的文本到 3D 合成，而无需 3D 训练数据。我们的关键见解源于像素级梯度的不兼容性：由于奖励模型训练期间缺乏噪声样本，直接应用 2D 奖励梯度会扰乱去噪过程。注意到条件扩散模型中的朴素分类器指导中也出现类似的问题，我们从根本上重新考虑通过隐式奖励模型将偏好对齐作为无分类器指导（CFG）式机制。此外，认识到冻结的预训练扩散模型会限制性能，我们引入了一种自适应策略来共同优化偏好分数和负文本嵌入。通过在优化期间合并 CFG，负文本嵌入的在线细化可动态增强对齐。据我们所知，我们是第一个在分数蒸馏框架下将人类偏好与 CFG 理论联系起来的人。实验证明了PSD在美学指标、与多种管道的无缝集成以及强大的可扩展性方面的优越性。

Title: Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement

Authors: Xiwen Wang, Shichao Zhang, Hailun Zhang, Ruowei Wang, Mao Li, Chenyu Zhou, Qijun Zhao, Ji-Zhe Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01601
Pdf URL: https://arxiv.org/pdf/2603.01601
Copy Paste: [[2603.01601]] Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement(https://arxiv.org/abs/2603.01601)
Keywords: generation
Abstract: Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine this http URL further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.
摘要：大型 3D 重建模型彻底改变了 3D 内容生成领域，在虚拟现实和游戏中实现了广泛的应用。就像其他大型模型一样，大型 3D 重建模型也会出现幻觉，引入偏离输入数据的结构异常值（例如奇怪的孔或突起）。然而，与其他大型模型不同，大型 3D 重建模型中的幻觉仍然没有得到严重的探索，导致 3D 打印的物体畸形或虚拟场景的沉浸感不足。这种幻觉主要源于现有方法从稀疏生成的多视图图像中重建 3D 内容，这些图像存在较大的视点间隙和不连续性。为了通过消除异常值来减轻幻觉，我们建议使用 Dehallu3D 来生成 3D 网格。我们的关键思想是设计一个平衡的多视图连续性约束，以强制在密集的中间视点之间进行平滑过渡，同时避免过度平滑，从而消除尖锐的几何特征。因此，Dehallu3D 采用即插即用优化模块，具有两个关键约束：(i) 相邻一致性以确保跨视图的几何连续性，以及 (ii) 自适应平滑度以很好地保留此 http URL，进一步提出离群风险度量 (ORM) 指标，以从离群值的角度量化 3D 生成中的几何保真度。大量实验表明，Dehallu3D 通过有效保留结构细节同时去除幻觉异常值，实现了高保真 3D 生成。

Title: Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Authors: Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01623
Pdf URL: https://arxiv.org/pdf/2603.01623
Copy Paste: [[2603.01623]] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration(https://arxiv.org/abs/2603.01623)
Keywords: generation
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
摘要：扩散模型已成为高保真图像和视频生成的主要工具，但由于扩散变压器的大量迭代，其推理速度受到严重瓶颈。为了减少详尽的计算，最近的工作采用了特征缓存和重用方案，该方案通过使用先前步骤中的缓存特征来跳过选定扩散步骤的网络评估。然而，他们的初步设计仅依赖于局部近似，导致误差随着较大的跳跃而快速增长，并导致高速加速时样本质量下降。在这项工作中，我们提出了谱扩散特征预测器（Spectrum），这是一种免训练的方法，可以在严格控制误差的情况下实现全局、远程特征重用。特别是，我们将降噪器的潜在特征视为随时间变化的函数，并用切比雪夫多项式对其进行近似。具体来说，我们通过岭回归拟合每个基的系数，然后利用该系数来预测未来多个扩散步骤的特征。我们从理论上揭示，我们的方法允许更有利的长视野行为，并产生不与步长复合的误差界限。对各种最先进的图像和视频扩散模型的广泛实验一致验证了我们方法的优越性。值得注意的是，我们在 FLUX.1 上实现了高达 4.79$\times$ 的加速，在 Wan2.1-14B 上实现了 4.67$\times$ 的加速，同时与基线相比保持了更高的样本质量。

Title: DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving

Authors: Enhui Ma, Jiahuan Zhang, Guantian Zheng, Tao Tang, Shengbo Eben Li, Yuhang Lu, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Zhihui Hao, Xianpeng Lang, Kaicheng Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01637
Pdf URL: https://arxiv.org/pdf/2603.01637
Copy Paste: [[2603.01637]] DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving(https://arxiv.org/abs/2603.01637)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
摘要：多模态大型语言模型 (MLLM) 正在迅速成为端到端自动驾驶系统的智能大脑。一个关键的挑战是评估 MLLM 是否能够真正理解并遵循复杂的现实世界交通规则。然而，现有的基准测试主要关注交通标志识别等单规则场景，忽略了实际驾驶中多规则并发和冲突的复杂性。因此，模型在简单任务上表现良好，但在现实世界的复杂情况下经常失败或违反规则。为了弥补这一差距，我们提出了 DriveCombo，这是一种基于文本和视觉的组合交通规则推理基准。受人类驾驶员认知发展的启发，我们提出了一个系统的五级认知阶梯，评估从单规则理解到多规则整合和冲突解决的推理，从而实现跨认知阶段的定量评估。我们进一步提出了一种 Rule2Scene Agent，通过规则制定和场景生成将基于语言的交通规则映射到动态驾驶场景，从而实现场景级交通规则视觉推理。对 14 个主流 MLLM 的评估表明，随着任务复杂性的增加，尤其是在规则冲突期间，性能会下降。在分割数据集并对训练集进行微调后，我们进一步观察到交通规则推理和下游规划能力的显着改进。这些结果凸显了 DriveCombo 在推进合规和智能自动驾驶系统方面的有效性。

Title: QCAgent: An agentic framework for quality-controllable pathology report generation from whole slide image

Authors: Rundong Wang, Wei Ba, Ying Zhou, Yingtai Li, Bowen Liu, Baizhi Wang, Yuhao Wang, Zhidong Yang, Kun Zhang, Rui Yan, S. Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01647
Pdf URL: https://arxiv.org/pdf/2603.01647
Copy Paste: [[2603.01647]] QCAgent: An agentic framework for quality-controllable pathology report generation from whole slide image(https://arxiv.org/abs/2603.01647)
Keywords: generation
Abstract: Recent methods for pathology report generation from whole-slide image (WSI) are capable of producing slide-level diagnostic descriptions but fail to ground fine-grained statements in localized visual evidence. Furthermore, they lack control over which diagnostic details to include and how to verify them. Inspired by emerging agentic analysis paradigms and the diagnostic workflow of pathologists,who selectively examine multiple fields of view, we propose QCAgent, an agentic framework for quality-controllable WSI report generation. The core innovations of this framework are as follows: (i) it incorporates a customized critique mechanism guided by a user-defined checklist specifying required diagnostic details and constraints; (ii) it re-identifies informative regions in the WSI based on the critique feedback and text-patch semantic retrieval, a process that iteratively enriches and reconciles the report. Experiments demonstrate that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSI.
摘要：最近从全幻灯片图像（WSI）生成病理报告的方法能够产生幻灯片级别的诊断描述，但无法在局部视觉证据中提供细粒度的陈述。此外，他们无法控制要包含哪些诊断详细信息以及如何验证它们。受新兴的代理分析范式和病理学家的诊断工作流程的启发，病理学家选择性地检查多个视野，我们提出了 QCAgent，一种用于质量可控的 WSI 报告生成的代理框架。该框架的核心创新如下：（i）它采用了定制的批评机制，该机制由用户定义的清单指导，指定所需的诊断细节和约束； (ii) 它根据批评反馈和文本补丁语义检索重新识别 WSI 中的信息区域，这是一个迭代丰富和协调报告的过程。实验表明，通过使报告要求明确提示定义、约束感知并通过基于证据的细化进行验证，QCAgent 能够从 WSI 可控地生成具有临床意义和高覆盖率的病理报告。

Title: Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling

Authors: Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2603.01655
Pdf URL: https://arxiv.org/pdf/2603.01655
Copy Paste: [[2603.01655]] Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling(https://arxiv.org/abs/2603.01655)
Keywords: generative
Abstract: Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the power of the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics to reduce the number of path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a comprehensive machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying such generative models to this domain presents significant challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key architectural components. First, we implement an \emph{experience replay buffer} to capture and retain rare valid paths. Second, we adopt a uniform exploratory policy to improve generalization and prevent the model from overfitting to simple geometries. Third, we apply a physics-based action masking strategy that filters out physically impossible paths before the model even considers them. As demonstrated in our experimental validation, the proposed model achieves substantial speedups over exhaustive search -- up to $10\times$ faster on GPU and $1000\times$ faster on CPU -- while maintaining high coverage accuracy and successfully uncovering complex propagation paths. The complete source code, tests, and tutorial are available at this https URL.
摘要：光线追踪已成为精确无线电传播建模的标准，但其计算复杂性呈指数级增长，因为候选路径的数量随着对象数量的交互阶次幂而变化。这一瓶颈限制了其在大规模或实时应用中的使用，迫使传统工具依靠启发式方法来减少候选路径的数量，但代价可能是降低准确性。为了克服这一限制，我们提出了一个全面的机器学习辅助框架，通过生成流网络用智能采样取代详尽的路径搜索。将此类生成模型应用于该领域面临着巨大的挑战，特别是由于有效路径的稀有而导致的奖励稀疏，这可能会导致在复杂环境中评估高阶交互时收敛失败和琐碎的解决方案。为了确保稳健的学习和高效的探索，我们的框架包含三个关键的架构组件。首先，我们实现一个 \emph{experience replay buffer} 来捕获并保留罕见的有效路径。其次，我们采用统一的探索策略来提高泛化能力并防止模型过度拟合简单的几何形状。第三，我们应用基于物理的动作屏蔽策略，在模型考虑物理上不可能的路径之前过滤掉它们。正如我们的实验验证所证明的，所提出的模型比穷举搜索实现了大幅加速——在 GPU 上速度提高了 10 倍，在 CPU 上速度提高了 1000 倍——同时保持了高覆盖精度并成功发现了复杂的传播路径。完整的源代码、测试和教程可从此 https URL 获取。

Title: FreeGNN: Continual Source-Free Graph Neural Network Adaptation for Renewable Energy Forecasting

Authors: Abderaouf Bahi, Amel Ourici, Ibtissem Gasmi, Aida Derrablia, Warda Deghmane, Mohamed Amine Ferrag
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01657
Pdf URL: https://arxiv.org/pdf/2603.01657
Copy Paste: [[2603.01657]] FreeGNN: Continual Source-Free Graph Neural Network Adaptation for Renewable Energy Forecasting(https://arxiv.org/abs/2603.01657)
Keywords: generation
Abstract: Accurate forecasting of renewable energy generation is essential for efficient grid management and sustainable power planning. However, traditional supervised models often require access to labeled data from the target site, which may be unavailable due to privacy, cost, or logistical constraints. In this work, we propose FreeGNN, a Continual Source-Free Graph Domain Adaptation framework that enables adaptive forecasting on unseen renewable energy sites without requiring source data or target labels. Our approach integrates a spatio-temporal Graph Neural Network (GNN) backbone with a teacher--student strategy, a memory replay mechanism to mitigate catastrophic forgetting, graph-based regularization to preserve spatial correlations, and a drift-aware weighting scheme to dynamically adjust adaptation strength during streaming updates. This combination allows the model to continuously adapt to non-stationary environmental conditions while maintaining robustness and stability. We conduct extensive experiments on three real-world datasets: GEFCom2012, Solar PV, and Wind SCADA, encompassing multiple sites, temporal resolutions, and meteorological features. The ablation study confirms that each component memory, graph regularization, drift-aware adaptation, and teacher--student strategy contributes significantly to overall performance. The experiments show that FreeGNN achieves an MAE of 5.237 and an RMSE of 7.123 on the GEFCom dataset, an MAE of 1.107 and an RMSE of 1.512 on the Solar PV dataset, and an MAE of 0.382 and an RMSE of 0.523 on the Wind SCADA dataset. These results demonstrate its ability to achieve accurate and robust forecasts in a source-free, continual learning setting, highlighting its potential for real-world deployment in adaptive renewable energy systems. For reproducibility, implementation details are available at: this https URL.
摘要：准确预测可再生能源发电对于高效的电网管理和可持续电力规划至关重要。然而，传统的监督模型通常需要访问目标站点的标记数据，而由于隐私、成本或后勤限制，这些数据可能无法获得。在这项工作中，我们提出了 FreeGNN，这是一种持续无源图域适应框架，可以在不需要源数据或目标标签的情况下对看不见的可再生能源站点进行自适应预测。我们的方法将时空图神经网络（GNN）主干与师生策略、用于减轻灾难性遗忘的记忆重放机制、用于保留空间相关性的基于图的正则化以及用于在流更新期间动态调整适应强度的漂移感知加权方案集成在一起。这种组合使模型能够不断适应非平稳环境条件，同时保持鲁棒性和稳定性。我们对三个真实世界数据集进行了广泛的实验：GEFCom2012、太阳能光伏和风能 SCADA，涵盖多个站点、时间分辨率和气象特征。消融研究证实，每个组件的记忆、图正则化、漂移感知适应和师生策略对整体表现有显着贡献。实验表明，FreeGNN 在 GEFCom 数据集上的 MAE 为 5.237，RMSE 为 7.123；在 Solar PV 数据集上的 MAE 为 1.107，RMSE 为 1.512；在 Wind SCADA 数据集上的 MAE 为 0.382，RMSE 为 0.523。这些结果证明了其在无源、持续学习环境中实现准确和稳健预测的能力，突显了其在自适应可再生能源系统中实际部署的潜力。为了重现性，实现细节可在以下网址获得：此 https URL。

Title: A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs

Authors: Aryan Goyal, Shreshtha Singh, Ashish Mittal, Manoj Tadepalli, Piyush Kumar, Preetham Putha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01659
Pdf URL: https://arxiv.org/pdf/2603.01659
Copy Paste: [[2603.01659]] A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs(https://arxiv.org/abs/2603.01659)
Keywords: generation
Abstract: Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.
摘要：胸部X光片（CXR）中肺癌的早期检测对于改善患者的治疗效果至关重要，但由于其微妙的外观以及尺寸、纹理和边界等放射学特征的可变性，结节的检测仍然具有挑战性。为了进行稳健的分析，这种多样性必须在基于深度学习的计算机辅助诊断 (CAD) 系统的训练数据集中得到很好的体现。然而，组装此类数据集成本高昂且通常不切实际，这就激发了对实际合成数据生成的需求。现有方法缺乏对合成结核生成的细粒度控制，限制了它们在解决数据稀缺方面的效用。本文提出了一种新颖的基于扩散的框架，具有低秩适应（LoRA）适配器，用于 CXR 上的特征控制结节合成。我们首先通过基础扩散模型的结节掩模条件训练来解决尺寸和形状控制问题。为了实现个体特征控制，我们训练单独的 LoRA 模块，每个模块专用于特定的放射学特征。然而，由于结节很少表现出孤立的特征，有效的多特征控制需要特征的平衡集成。我们通过利用 LoRA 的动态可组合性并重新审视现有的合并策略来解决这个问题。在此基础上，我们确定了两个关键问题，重叠的注意力区域和非正交参数空间。为了克服这些限制，我们在 LoRA 组合训练期间引入了一种新的正交性损失项。对内部和公共数据集的广泛实验证明了下游结核检测的改进。放射科医生的评估证实了我们生成的结节的细粒度可控性，并且在多个定量指标中，我们的方法超越了现有的 CXR 结节生成方法。

Title: FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

Authors: Shao Shitong, Gu Yufei, Xie Zeke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01685
Pdf URL: https://arxiv.org/pdf/2603.01685
Copy Paste: [[2603.01685]] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters(https://arxiv.org/abs/2603.01685)
Keywords: generation, generative
Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
摘要：最近出现的强大的视频生成模型，如Hunyuan、WanX、Veo3和Kling，开创了该领域的新时代。然而，这些模型的实际部署受到巨大的计算开销的严重阻碍，这些开销源于大量的参数计数和推理过程中所需的迭代、多步骤采样过程。先前关于加速生成模型的研究主要遵循两个不同的轨迹：减少采样步骤的数量（例如 LCM、DMD 和 MagicDistillation）或压缩模型大小以提高推理效率（例如 ICMD）。同时压缩两者以创建快速且轻量级模型的潜力仍然是一个尚未探索的途径。在本文中，我们提出了 FastLightGen，这是一种将大型、计算成本昂贵的模型转换为快速、轻量级模型的算法。其核心思想是构建一个最佳的教师模型，该模型旨在在一个用于提炼模型大小和推理步骤的协同框架内最大限度地提高学生的表现。我们在HunyuanVideo-ATI2V和WanX-TI2V上进行的大量实验表明，使用4步采样和30％参数修剪的生成器可以在受限的推理预算下实现最佳的视觉质量。此外，FastLightGen 始终优于所有竞争方法，在高效视频生成方面建立了新的最先进技术。

Title: DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs

Authors: Aryan Goyal, Ashish Mittal, Pranav Rao, Manoj Tadepalli, Preetham Putha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01686
Pdf URL: https://arxiv.org/pdf/2603.01686
Copy Paste: [[2603.01686]] DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs(https://arxiv.org/abs/2603.01686)
Keywords: restoration, generative
Abstract: Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.
摘要：基于深度学习的肺癌自动诊断已成为一项重要进步，使医疗保健专业人员能够更早地发现并开始治疗。然而，这些模型需要具有不同案例特定属性的广泛训练数据集。获得高质量带注释的数据尤其具有挑战性，特别是对于即使对于经验丰富的放射科医生也难以检测到的细微肺部结节的病例。标记良好的数据集的缺乏可能会限制模型在不同患者群体中的性能和泛化。使用 CT 扫描生成人工插入肺结节的合成正面胸部 X 射线的数字重建放射线照片 (DRR) 提供了一种潜在的解决方案。然而，这种方法会遭受显着的图像质量下降，特别是解剖特征模糊和精细肺野结构丢失的形式。为了克服这个问题，我们引入了 DiffusionXRay，这是一种新颖的胸部 X 射线图像图像恢复管道，它协同利用去噪扩散概率模型 (DDPM) 和生成对抗网络 (GAN)。 DiffusionXRay 采用独特的两阶段训练过程：首先，我们研究两种独立的方法（DDPM-LQ 和基于 GAN 的 MUNIT-LQ）来生成低质量的 CXR，解决训练数据稀缺的挑战，将其视为风格迁移问题。随后，我们在配对的低质量和高质量图像上训练基于 DDPM 的模型，使其能够学习 X 射线图像恢复的细微差别。我们的方法在增强胸部 X 光图像清晰度、对比度和整体诊断价值方面取得了有希望的结果，同时保留了微妙但具有临床意义的伪影，并通过定量指标和专家放射学评估进行了验证。

Title: Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration

Authors: Guanglu Dong, Chunlei Li, Chao Ren, Jingliang Hu, Yilei Shi, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01725
Pdf URL: https://arxiv.org/pdf/2603.01725
Copy Paste: [[2603.01725]] Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration(https://arxiv.org/abs/2603.01725)
Keywords: restoration
Abstract: Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at this https URL.
摘要：最近，一体式图像修复（AiOIR）取得了重大突破，可以用单个模型处理多个修复任务。然而，现有的方法通常关注特定的图像领域，例如自然场景、医学成像或遥感。在这项工作中，我们的目标是将 AiOIR 扩展到多个领域，并基于我们提出的领域感知任务提示表示学习，提出第一个多域一体化图像恢复方法 DATPRL-IR。具体来说，我们首先构建一个包含多个任务提示的任务提示池，其中任务相关的知识被隐式编码。对于每个输入图像，模型自适应地选择最相关的任务提示，并通过提示组合机制（PCM）将它们组合成实例级任务表示。此外，为了赋予模型领域意识，我们引入了另一个领域提示池，并将多模态大语言模型中的领域先验提取到领域提示中。 PCM 用于将自适应选择的域提示组合成每个输入图像的域表示。最后，将两种表示融合形成域感知任务提示表示，该表示可以充分利用跨任务和域的特定知识和共享知识来指导后续的恢复过程。大量实验表明，我们的 DATPRL-IR 显着优于现有的 SOTA 图像恢复方法，同时表现出强大的泛化能力。代码可从此 https URL 获取。

Title: NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation

Authors: Rong Fu, Yiqing Lyu, Chunlei Meng, Muge Qi, Yabin Jin, Qi Zhao, Li Bao, Juntao Gao, Fuqian Shi, Nilanjan Dey, Wei Luo, Simon Fong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01756
Pdf URL: https://arxiv.org/pdf/2603.01756
Copy Paste: [[2603.01756]] NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation(https://arxiv.org/abs/2603.01756)
Keywords: generation
Abstract: Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.
摘要：自动生成放射学报告旨在减少临床医生的工作量，同时提高文档一致性。采用编码器-解码器或检索增强管道的现有方法在流畅性方面取得了进步，但仍然容易受到视觉语言偏差、事实不一致和缺乏明确的多跳临床推理的影响。我们提出了 NeuroSymb-MRG，这是一个统一的框架，它将 NeuroSymbolic 溯因推理与主动不确定性最小化相结合，以生成结构化的、基于临床的报告。该系统将图像特征映射到概率临床概念，组成基于可微分逻辑的推理链，将这些链解码为模板化子句，并通过检索和约束语言模型编辑来细化文本输出。由规则级不确定性和多样性驱动的主动采样循环指导临床医生在循环中的裁决和提示手册的完善。标准基准测试的实验表明，与代表性基准相比，事实一致性和标准语言指标得到了持续改进。

Title: StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models

Authors: Keli Liu, Zhendong Wang, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01757
Pdf URL: https://arxiv.org/pdf/2603.01757
Copy Paste: [[2603.01757]] StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models(https://arxiv.org/abs/2603.01757)
Keywords: generation
Abstract: Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.
摘要：基于下一尺度预测的视觉自回归（VAR）模型可以实现高效的分层生成，但推理成本在高分辨率下呈二次方增长。我们观察到，与决定全局结构布局的早期尺度相比，计算密集型的后期尺度主要细化高频纹理并表现出大量的空间冗余。现有的修剪方法主要集中于标记选择的高频检测，常常忽视结构一致性，从而降低全局语义。为了解决这个限制，我们提出了 StepVAR，一种免训练的 token 修剪框架，通过共同考虑结构和纹理的重要性来加速 VAR 推理。具体来说，我们采用轻量级高通滤波器来捕获局部纹理细节，同时利用主成分分析（PCA）来保留全局结构信息。这种双标准设计使模型能够保留对细粒度保真度和整体构成至关重要的标记。为了在稀疏标记下保持有效的下一尺度预测，我们进一步引入了最近邻特征传播策略，以从修剪后的表示重建密集特征图。对最先进的文本到图像和文本到视频 VAR 模型的大量实验表明，StepVAR 在保持生成质量的同时实现了显着的推理加速。定量和定性评估一致表明，我们的方法优于现有的加速方法，验证了其在不同 VAR 架构中的有效性和普遍适用性。

Title: CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning

Authors: Pratik Jawahar, Maurizio Pierini
Subjects: cs.LG, cs.AI, physics.app-ph
Abstract URL: https://arxiv.org/abs/2603.01768
Pdf URL: https://arxiv.org/pdf/2603.01768
Copy Paste: [[2603.01768]] CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning(https://arxiv.org/abs/2603.01768)
Keywords: generative
Abstract: Current deep learning primitives dealing with temporal dynamics suffer from a fundamental dichotomy: they are either discrete and unstable (LSTMs) \citep{pascanu_difficulty_2013}, leading to exploding or vanishing gradients; or they are continuous and dissipative (Neural ODEs) \citep{dupont_augmented_2019}, which destroy information over time to ensure stability. We propose the \textbf{Causal Hamiltonian Learning Unit} (pronounced: \textit{clue}), a novel Physics-grounded computational learning primitive. By enforcing a Relativistic Hamiltonian structure and utilizing symplectic integration, a CHLU strictly conserves phase-space volume, as an attempt to solve the memory-stability trade-off. We show that the CHLU is designed for infinite-horizon stability, as well as controllable noise filtering. We then demonstrate a CHLU's generative ability using the MNIST dataset as a proof-of-principle.
摘要：当前处理时间动态的深度学习基元存在一个基本的二分法：它们要么是离散且不稳定的（LSTM）\citep{pascanu_difficulty_2013}，导致梯度爆炸或消失；或者它们是连续且耗散的（神经常微分方程）\citep{dupont_augmented_2019}，随着时间的推移会破坏信息以确保稳定性。我们提出了 \textbf{因果哈密顿学习单元}（发音：\textit{clue}），一种新颖的基于物理的计算学习原语。通过强制相对论哈密顿结构并利用辛积分，CHLU 严格守恒相空间体积，试图解决内存稳定性权衡问题。我们证明 CHLU 专为无限范围稳定性以及可控噪声过滤而设计。然后，我们使用 MNIST 数据集作为原理验证来展示 CHLU 的生成能力。

Title: D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Authors: Zhao Yang, Hengchang Liu, Chuan Cao, Bing Su
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2603.01780
Pdf URL: https://arxiv.org/pdf/2603.01780
Copy Paste: [[2603.01780]] D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation(https://arxiv.org/abs/2603.01780)
Keywords: generation, generative
Abstract: Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at this https URL.
摘要：早期的DNA基础模型采用BERT式的训练，在DNA理解任务上取得了良好的表现，但缺乏生成能力。最近的自回归模型能够生成 DNA，但采用从左到右的因果模型，这对于 DNA 来说不是最佳的，因为监管关系本质上是双向的。我们提出了 D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel)，它通过掩模扩散统一了双向表示学习和 DNA 生成。 D3LM 直接采用 Nucleotide Transformer (NT) v2 架构，但将训练目标重新表述为离散 DNA 空间中的掩蔽扩散，从而在单个模型中实现双向理解和生成功能。与相同大小的 NT v2 相比，D3LM 在理解任务方面取得了改进的性能。值得注意的是，在调控元件生成方面，D3LM 的 SFID 为 10.92，非常接近真实 DNA 序列 (7.85)，并且大大优于自回归模型之前的最佳结果 29.16。我们的工作表明扩散语言模型是统一 DNA 基础模型的一个有前途的范例。我们进一步提出了 DNA 领域中掩蔽扩散模型的首次系统研究，研究了标记化方案和采样策略等实际设计选择，从而为未来的研究提供了实证见解和坚实的基础。 D3LM 已在此 https URL 发布。

Title: Learning Shortest Paths with Generative Flow Networks

Authors: Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01786
Pdf URL: https://arxiv.org/pdf/2603.01786
Copy Paste: [[2603.01786]] Learning Shortest Paths with Generative Flow Networks(https://arxiv.org/abs/2603.01786)
Keywords: generative
Abstract: In this paper, we present a novel learning framework for finding shortest paths in graphs utilizing Generative Flow Networks (GFlowNets). First, we examine theoretical properties of GFlowNets in non-acyclic environments in relation to shortest paths. We prove that, if the total flow is minimized, forward and backward policies traverse the environment graph exclusively along shortest paths between the initial and terminal states. Building on this result, we show that the pathfinding problem in an arbitrary graph can be solved by training a non-acyclic GFlowNet with flow regularization. We experimentally demonstrate the performance of our method in pathfinding in permutation environments and in solving Rubik's Cubes. For the latter problem, our approach shows competitive results with state-of-the-art machine learning approaches designed specifically for this task in terms of the solution length, while requiring smaller search budget at test-time.
摘要：在本文中，我们提出了一种新颖的学习框架，用于利用生成流网络（GFlowNets）在图中查找最短路径。首先，我们研究非循环环境中 GFlowNet 与最短路径相关的理论特性。我们证明，如果总流量最小化，前向和后向策略仅沿着初始状态和终止状态之间的最短路径遍历环境图。在此结果的基础上，我们表明可以通过训练具有流正则化的非循环 GFlowNet 来解决任意图中的寻路问题。我们通过实验证明了我们的方法在排列环境中寻路和求解魔方方面的性能。对于后一个问题，我们的方法在解决方案长度方面显示了与专门为此任务设计的最先进的机器学习方法的竞争结果，同时在测试时需要较小的搜索预算。

Title: Phase-Type Variational Autoencoders for Heavy-Tailed Data

Authors: Abdelhakim Ziani, András Horváth, Paolo Ballarini
Subjects: cs.LG, cs.AI, stat.ML, stat.OT
Abstract URL: https://arxiv.org/abs/2603.01800
Pdf URL: https://arxiv.org/pdf/2603.01800
Copy Paste: [[2603.01800]] Phase-Type Variational Autoencoders for Heavy-Tailed Data(https://arxiv.org/abs/2603.01800)
Keywords: generative
Abstract: Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.
摘要：重尾分布在现实世界的数据中普遍存在，其中罕见但极端的事件主导着风险和变异性。然而，标准变分自动编码器（VAE）采用简单的解码器分布（例如高斯），无法捕获重尾行为，而现有的重尾感知扩展仍然仅限于预定义的参数族，其尾部行为是先验固定的。我们提出了相位型变分自编码器（PH-VAE），其解码器分布是潜在条件相位型（PH）分布，定义为连续时间马尔可夫链（CTMC）的吸收时间。该公式组成了多个指数时间尺度，产生了一个灵活且易于分析处理的解码器，可以直接根据观察到的数据调整其尾部行为。合成和现实世界基准的实验表明，PH-VAE 可以准确地恢复各种重尾分布，在尾部行为和极端分位数建模方面显着优于高斯、Student-t 和基于极值的 VAE 解码器。在多变量设置中，PH-VAE 通过其共享的潜在表示捕获现实的跨维度尾部依赖性。据我们所知，这是第一个将相型分布集成到深度生成建模中、连接应用概率和表示学习的工作。

Title: Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

Authors: Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01804
Pdf URL: https://arxiv.org/pdf/2603.01804
Copy Paste: [[2603.01804]] Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments(https://arxiv.org/abs/2603.01804)
Keywords: generative
Abstract: We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
摘要：我们研究了关于在使用全身运动的非语言交流背景下人工智能生成的数据与人类生成的数据相比的统计保真度的持续争论。具体来说，我们询问当代生成模型是否超越表面模仿，参与无声但富有表现力的肢体语言对话。我们通过引入第一个框架来解决这个问题，该框架从 2D 身体关键点实时生成人类和人工智能之间的自然非语言交互。我们的实验利用四种轻量级架构，在 NVIDIA Orin Nano 上以高达 100 FPS 的速度运行，有效地闭合了自然人机交互所需的感知-动作循环。我们对 437 个人类视频剪辑进行了训练，并证明对合成生成的序列进行预训练可以显着减少运动错误，而不会牺牲速度。然而，可衡量的现实差距仍然存在。当根据从尖端文本到视频系统（例如 SORA 和 VEO）提取的关键点评估最佳模型时，我们观察到 SORA 生成的剪辑的性能下降。然而，它在 VEO 上的降级要少得多，这表明是时间连贯性而不是图像保真度驱动了现实世界的性能。我们的结果表明，人类和人工智能运动之间仍然存在统计上可区分的差异。

Title: Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes

Authors: Hongkun Dou, Zike Chen, Zeyu Li, Hongjue Li, Lijun Yang, Yue Deng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01837
Pdf URL: https://arxiv.org/pdf/2603.01837
Copy Paste: [[2603.01837]] Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes(https://arxiv.org/abs/2603.01837)
Keywords: generative
Abstract: Diffusion models have gained prominence as powerful generative tools for solving inverse problems due to their ability to model complex data distributions. However, existing methods typically rely on complete knowledge of the forward observation process to compute gradients for guided sampling, limiting their applicability in scenarios where such information is unavailable. In this work, we introduce \textbf{\emph{Constrained Particle Seeking (CPS)}}, a novel gradient-free approach that leverages all candidate particle information to actively search for the optimal particle while incorporating constraints aligned with high-density regions of the unconditional prior. Unlike previous methods that passively select promising candidates, CPS reformulates the inverse problem as a constrained optimization task, enabling more flexible and efficient particle seeking. We demonstrate that CPS can effectively solve both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives. Code is available at this https URL.
摘要：由于扩散模型能够对复杂的数据分布进行建模，因此它作为解决逆问题的强大生成工具而受到重视。然而，现有方法通常依赖于对前向观测过程的完整了解来计算引导采样的梯度，这限制了它们在此类信息不可用的情况下的适用性。在这项工作中，我们引入了 \textbf{\emph{约束粒子搜索（CPS）}}，这是一种新颖的无梯度方法，它利用所有候选粒子信息来主动搜索最佳粒子，同时合并与无条件先验的高密度区域对齐的约束。与以前被动选择有希望的候选者的方法不同，CPS 将逆问题重新表述为约束优化任务，从而实现更灵活、更高效的粒子搜索。我们证明 CPS 可以有效地解决图像和科学逆问题，取得与基于梯度的方法相当的结果，同时显着优于无梯度的替代方法。代码可从此 https URL 获取。

Title: FireRed-OCR Technical Report

Authors: Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2603.01840
Pdf URL: https://arxiv.org/pdf/2603.01840
Copy Paste: [[2603.01840]] FireRed-OCR Technical Report(https://arxiv.org/abs/2603.01840)
Keywords: generation
Abstract: We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
摘要：我们推出 FireRed-OCR，这是一个将通用 VLM 专门化为高性能 OCR 模型的系统框架。大型视觉语言模型 (VLM) 已展现出令人印象深刻的通用功能，但在处理复杂文档时经常会出现“结构幻觉”，从而限制了它们在工业 OCR 应用中的实用性。在本文中，我们介绍了 FireRed-OCR，这是一种新颖的框架，旨在将通用 VLM（基于 Qwen3-VL）转变为像素精确的结构文档解析专家。为了解决高质量结构化数据的稀缺问题，我们构建了“几何+语义”数据工厂。与传统的随机采样不同，我们的管道利用几何特征聚类和多维标记来合成和整理高度平衡的数据集，有效处理长尾布局和罕见的文档类型。此外，我们提出了一种三阶段渐进训练策略，指导模型从像素级感知到逻辑结构生成。该课程包括：（1）多任务预对齐，为模型对文档结构的理解奠定基础； (2) 专门用于标准化全图 Markdown 输出的 SFT；（3）格式约束组相对策略优化（GRPO），它利用强化学习来强制执行严格的语法有效性和结构完整性（例如，表闭包、公式语法）。对 OmniDocBench v1.5 的广泛评估表明，FireRed-OCR 实现了最先进的性能，总体得分为 92.94\%，在文本、公式、表格和阅读顺序指标方面显着优于 DeepSeek-OCR 2 和 OCRVerse 等强大基线。我们开源代码和模型权重，以促进“通用 VLM 到专业结构专家”范例。

Title: Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

Authors: Montijn van den Beukel, Jože Martin Rožanec, Ana-Lucia Varbanescu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01863
Pdf URL: https://arxiv.org/pdf/2603.01863
Copy Paste: [[2603.01863]] Tide: A Customisable Dataset Generator for Anti-Money Laundering Research(https://arxiv.org/abs/2603.01863)
Keywords: generation
Abstract: The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.
摘要：缺乏可访问的交易数据极大地阻碍了反洗钱（AML）的机器学习研究。隐私和法律问题阻碍了真实金融数据的共享，而现有的合成生成器专注于简单化的结构模式，而忽略了复杂洗钱计划的时间动态（时间和频率）。我们推出了 Tide，一种开源综合数据集生成器，可生成基于图形的金融网络，其中包含由结构和时间特征定义的洗钱模式。 Tide 能够根据特定研究需求生成可重复、可定制的数据集。我们发布了两个具有不同非法比率的参考数据集（LI：0.10％，HI：0.19％），同时实施了最先进的检测模型。对这些数据集的评估揭示了与条件相关的模型排名：LightGBM 在低非法比率条件下实现了最高的 PR-AUC (78.05)，而 XGBoost 在较高的欺诈发生率下表现最好 (85.12)。这些不同的排名表明，参考数据集可以有意义地区分不同操作条件下的模型功能。 Tide 为研究社区提供了一个可配置的基准，可以揭示跨模型架构的有意义的性能变化，从而推动稳健的 AML 检测方法的开发。

Title: CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection

Authors: Yiheng Li, Zichang Tan, Guoqing Xu, Yijun Ye, Yang Yang, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01878
Pdf URL: https://arxiv.org/pdf/2603.01878
Copy Paste: [[2603.01878]] CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection(https://arxiv.org/abs/2603.01878)
Keywords: generative
Abstract: With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.
摘要：随着生成式人工智能在医学成像领域的快速发展，合成计算机断层扫描（CT）图像在数据增强和临床诊断等应用中展现出巨大潜力，但也带来了严重的安全风险。尽管安全问题日益严重，但现有的 CT 伪造检测研究仍然有限，无法充分解决现实世界的挑战。这些局限性主要体现在两个方面：缺乏能够有效评估模型泛化能力以反映现实世界应用需求的数据集，以及依赖于针对自然图像设计的检测方法，而这些方法对 CT 特定的伪造伪影不敏感。鉴于此，我们提出了 CTForensics，这是一个综合数据集，旨在系统地评估 CT 伪造检测方法的泛化能力，其中包括十种不同的 CT 生成方法。此外，我们还引入了增强型空间频率 CT 伪造检测器 (ESF-CTFD)，这是一种基于 CNN 的高效神经网络，可跨小波、空间和频域捕获伪造线索。首先，它将输入的 CT 图像转换为三个尺度，并通过小波增强中央干提取每个尺度的特征。然后，从最大尺度的特征开始，空间处理块逐渐与较小尺度的特征进行特征融合。最后，频率处理块学习频域信息以预测最终结果。实验表明，ESF-CTFD 始终优于现有方法，并在不同的 CT 生成模型中表现出优异的泛化能力。

Title: Generative Visual Chain-of-Thought for Image Editing

Authors: Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Bing Li, Zheng Chang, Kongming Liang, Qinglin Lu, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01893
Pdf URL: https://arxiv.org/pdf/2603.01893
Copy Paste: [[2603.01893]] Generative Visual Chain-of-Thought for Image Editing(https://arxiv.org/abs/2603.01893)
Keywords: generative
Abstract: Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
摘要：现有的图像编辑方法很难感知在哪里进行编辑，尤其是在复杂的场景和细致入微的空间指令下。为了解决这个问题，我们提出了生成视觉思想链（GVCoT），这是一个统一的框架，它通过首先生成空间线索来定位目标区域，然后执行编辑来执行本机视觉推理。与之前的纯文本 CoT 或依赖于工具的视觉 CoT 范式不同，GVCoT 以端到端的方式联合优化推理和编辑阶段生成的视觉标记。这种方式促进了先天空间推理能力的出现，并能够更有效地利用视觉领域线索。训练GCVoT的主要挑战在于缺乏具有精确编辑区域注释的大规模编辑数据；为此，我们构建了 GVCoT-Edit-Instruct，这是一个涵盖 19 个任务的 180 万个高质量样本的数据集。我们采用渐进式训练策略：有监督微调，在最终编辑前建立推理轨迹的基础定位能力，然后进行强化学习，进一步提高推理和编辑质量。最后，我们介绍了SREdit-Bench，这是一个新的基准测试，旨在对复杂场景和细粒度引用表达式下的模型进行全面压力测试。实验表明，GVCoT 始终优于 SREdit-Bench 和 ImgEdit 上最先进的模型。我们希望我们的 GVCoT 能够激发未来对可解释和精确图像编辑的研究。

Title: LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Authors: Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01928
Pdf URL: https://arxiv.org/pdf/2603.01928
Copy Paste: [[2603.01928]] LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving(https://arxiv.org/abs/2603.01928)
Keywords: generation
Abstract: While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
摘要：虽然视觉-语言-动作（VLA）模型通过统一感知和规划彻底改变了自动驾驶，但它们对显式文本思维链（CoT）的依赖导致了语义-感知脱钩和感知-符号冲突。最近向潜在推理的转变试图通过在连续的隐藏空间中思考来绕过这些瓶颈。然而，如果没有明确的中间约束，标准潜在 CoT 通常作为与物理无关的表示形式运行。为了解决这个问题，我们提出了潜在时空 VLA (LaST-VLA)，这是一个将推理范式从离散符号处理转变为物理基础的潜在时空 CoT 的框架。通过实现双特征对齐机制，我们将 3D 基础模型中的几何约束和世界模型中的动态预见直接提取到潜在空间中。再加上渐进式 SFT 训练策略，从特征对齐过渡到轨迹生成，并通过强化学习和组相对策略优化 (GRPO) 进行细化，以确保安全性和规则合规性。 \method~在 NAVSIM v1 (91.3 PDMS) 和 NAVSIM v2 (87.1 EPDMS) 上创造了新记录，同时在 SURDS 和 NuDynamics 基准上的时空推理方面表现出色。

Title: Dream2Learn: Structured Generative Dreaming for Continual Learning

Authors: Salvatore Calcagno, Matteo Pennisi, Federica Proietto Salanitri, Amelia Sorrenti, Simone Palazzo, Concetto Spampinato, Giovanni Bellitto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01935
Pdf URL: https://arxiv.org/pdf/2603.01935
Copy Paste: [[2603.01935]] Dream2Learn: Structured Generative Dreaming for Continual Learning(https://arxiv.org/abs/2603.01935)
Keywords: generative
Abstract: Continual learning requires balancing plasticity and stability while mitigating catastrophic forgetting. Inspired by human dreaming as a mechanism for internal simulation and knowledge restructuring, we introduce Dream2Learn (D2L), a framework in which a model autonomously generates structured synthetic experiences from its own internal representations and uses them for self-improvement. Rather than reconstructing past data as in generative replay, D2L enables a classifier to create novel, semantically distinct dreamed classes that are coherent with its learned knowledge yet do not correspond to previously observed data. These dreamed samples are produced by conditioning a frozen diffusion model through soft prompt optimization driven by the classifier itself. The generated data are not used to replace memory, but to expand and reorganize the representation space, effectively allowing the network to self-train on internally synthesized concepts. By integrating dreamed classes into continual training, D2L proactively structures latent features to support forward knowledge transfer and adaptation to future tasks. This prospective self-training mechanism mirrors the role of sleep in consolidating and reorganizing memory, turning internal simulations into a tool for improved generalization. Experiments on Mini-ImageNet, FG-ImageNet, and ImageNet-R demonstrate that D2L consistently outperforms strong rehearsal-based baselines and achieves positive forward transfer, confirming its ability to enhance adaptability through internally generated training signals.
摘要：持续学习需要平衡可塑性和稳定性，同时减少灾难性遗忘。受人类梦想作为内部模拟和知识重组机制的启发，我们引入了 Dream2Learn (D2L)，在这个框架中，模型根据自己的内部表征自主生成结构化的合成经验，并将其用于自我改进。 D2L 不像生成重放那样重建过去的数据，而是使分类器能够创建新颖的、语义上不同的梦想类，这些类与其学到的知识一致，但与之前观察到的数据不对应。这些梦想样本是通过分类器本身驱动的软提示优化调节冻结扩散模型而产生的。生成的数据不是用来取代内存，而是用来扩展和重新组织表示空间，有效地允许网络对内部合成的概念进行自我训练。通过将梦想的课程融入持续培训中，D2L 主动构建潜在特征，以支持向前的知识转移和适应未来任务。这种前瞻性的自我训练机制反映了睡眠在巩固和重组记忆中的作用，将内部模拟转变为提高泛化能力的工具。 Mini-ImageNet、FG-ImageNet 和 ImageNet-R 上的实验表明，D2L 始终优于基于排练的强大基线，并实现了正向传递，证实了其通过内部生成的训练信号增强适应性的能力。

Title: Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Authors: Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm, Christan Grant, Bonnie Dorr
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.01950
Pdf URL: https://arxiv.org/pdf/2603.01950
Copy Paste: [[2603.01950]] Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment(https://arxiv.org/abs/2603.01950)
Keywords: generative
Abstract: A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
摘要：一个使盲人或视障用户能够访问漫画的系统将为这个社区引入一种新的讲故事媒介。然而，目前尚不存在这样的系统。生成视觉语言模型（VLM）在描述图像和理解漫画方面显示出了希望，但大多数关于漫画理解的研究仅限于面板级分析。为了充分支持盲人和视障用户，必须更加关注页面级别的理解和解释。在这项工作中，我们提出了 VLM 在漫画解释任务上的初步性能基准。我们对这一过程中出现的幻觉进行识别和分类，将它们组织成广义的物体幻觉分类法。最后，我们对未来的研究提供了指导，强调减轻幻觉和改进漫画解释的数据管理。

Title: CoVAE: correlated multimodal generative modeling

Authors: Federico Caretti, Guido Sanguinetti
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2603.01965
Pdf URL: https://arxiv.org/pdf/2603.01965
Copy Paste: [[2603.01965]] CoVAE: correlated multimodal generative modeling(https://arxiv.org/abs/2603.01965)
Keywords: generation, generative
Abstract: Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.
摘要：多模态变分自动编码器已成为从丰富的多模态数据中提取有效表示的流行工具。然而，此类模型依赖于潜在空间中的融合策略，破坏了多模态数据的联合统计结构，对生成和不确定性量化具有深远的影响。在这项工作中，我们介绍了相关变分自动编码器（CoVAE），这是一种新的生成架构，可以捕获模态之间的相关性。我们在许多真实和合成数据集上测试了 CoVAE，证明了准确的跨模式重建和相关不确定性的有效量化。

Title: Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

Authors: Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01993
Pdf URL: https://arxiv.org/pdf/2603.01993
Copy Paste: [[2603.01993]] Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection(https://arxiv.org/abs/2603.01993)
Keywords: generative
Abstract: Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
摘要：生成人工智能的最新进展显着增强了多模式媒体操纵的真实性，从而对操纵检测提出了巨大的挑战。现有的操纵检测和接地方法主要集中在以结果为导向的监督下的操纵类型分类，这不仅缺乏可解释性，而且往往会过度拟合表面的伪影。在本文中，我们认为可推广的检测需要结合明确的取证推理，而不是仅仅对一组有限的操作类型进行分类，这无法推广到看不见的操作模式。为此，我们提出了 REFORM，一个推理驱动的框架，将学习从结果拟合转变为过程建模。 REFORM采用三阶段课程，首先诱导法证原理，然后将推理与最终判断结合起来，最后通过强化学习来完善逻辑一致性。为了支持这种范式，我们引入了 ROM，一个具有丰富推理注释的大规模数据集。大量实验表明，REFORM 建立了新的最先进性能，具有卓越的泛化能力，在 ROM 上实现了 81.52% ACC，在 DGM4 上实现了 76.65% ACC，在 MMFakeBench 上实现了 74.9 F1。

Title: Mitigating topology biases in Graph Diffusion via Counterfactual Intervention

Authors: Wendi Wang, Jiaxi Yang, Yongkang Du, Lu Lin
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2603.02005
Pdf URL: https://arxiv.org/pdf/2603.02005
Copy Paste: [[2603.02005]] Mitigating topology biases in Graph Diffusion via Counterfactual Intervention(https://arxiv.org/abs/2603.02005)
Keywords: generation
Abstract: Graph diffusion models have gained significant attention in graph generation tasks, but they often inherit and amplify topology biases from sensitive attributes (e.g. gender, age, region), leading to unfair synthetic graphs. Existing fair graph generation using diffusion models is limited to specific graph-based applications with complete labels or requires simultaneous updates for graph structure and node attributes, making them unsuitable for general usage. To relax these limitations by applying the debiasing method directly on graph topology, we propose Fair Graph Diffusion Model (FairGDiff), a counterfactual-based one-step solution that mitigates topology biases while balancing fairness and utility. In detail, we construct a causal model to capture the relationship between sensitive attributes, biased link formation, and the generated graph structure. By answering the counterfactual question "Would the graph structure change if the sensitive attribute were different?", we estimate an unbiased treatment and incorporate it into the diffusion process. FairGDiff integrates counterfactual learning into both forward diffusion and backward denoising, ensuring that the generated graphs are independent of sensitive attributes while preserving structural integrity. Extensive experiments on real-world datasets demonstrate that FairGDiff achieves a superior trade-off between fairness and utility, outperforming existing fair graph generation methods while maintaining scalability.
摘要：图扩散模型在图生成任务中受到了广泛关注，但它们经常继承并放大敏感属性（例如性别、年龄、地区）的拓扑偏差，导致合成图不公平。现有的使用扩散模型的公平图生成仅限于具有完整标签的特定基于图的应用程序，或者需要同时更新图结构和节点属性，使得它们不适合一般用途。为了通过直接在图拓扑上应用去偏差方法来放松这些限制，我们提出了公平图扩散模型（FairGDiff），这是一种基于反事实的一步解决方案，可以减轻拓扑偏差，同时平衡公平性和实用性。具体来说，我们构建了一个因果模型来捕获敏感属性、偏置链接形成和生成的图结构之间的关系。通过回答反事实问题“如果敏感属性不同，图结构会改变吗？”，我们估计一个无偏见的处理并将其纳入扩散过程。 FairGDiff 将反事实学习集成到前向扩散和后向降噪中，确保生成的图独立于敏感属性，同时保持结构完整性。对现实世界数据集的大量实验表明，FairGDiff 在公平性和实用性之间实现了卓越的权衡，在保持可扩展性的同时优于现有的公平图生成方法。

Title: Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families

Authors: Amir Asiaee, Samhita Pal
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.02010
Pdf URL: https://arxiv.org/pdf/2603.02010
Copy Paste: [[2603.02010]] Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families(https://arxiv.org/abs/2603.02010)
Keywords: generation
Abstract: Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.
摘要：许多差分隐私 (DP) 数据发布系统要么输出 DP 合成数据，让分析师照常进行推理，这可能导致严重的校准错误，要么输出 DP 点估计，而没有原则性的方法来进行不确定性量化。本文为指数族开发了一个干净且易于处理的中间立场：仅发布 DP 足够的统计数据，然后执行噪声校准的基于似然的推理和可选的参数合成数据生成作为后处理。我们的贡献是：（1）在高斯机制下近似DP发布修剪的足够统计量的通用方法； (2) 插件 DP MLE 的渐近正态性、显式方差膨胀和有效 Wald 式置信区间； (3) 噪声感知似然校正，与插件一阶等效，但支持基于引导的间隔； (4) 匹配的最小最大下界表明隐私失真率是不可避免的。由此产生的理论产生了具体的设计规则和实用的管道，用于发布具有原则性不确定性量化的 DP 合成数据，并在三个指数族和实际人口普查数据上进行了验证。

Title: MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising

Authors: Peiyuan Jing, Chun-Wun Cheng, Liutao Yang, Zhenxuan Zhang, Thiago V. Lima, Klaus Strobel, Antoine Leimgruber, Angelica Aviles-Rivero, Guang Yang, Javier A. Montoya-Zegarra
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02012
Pdf URL: https://arxiv.org/pdf/2603.02012
Copy Paste: [[2603.02012]] MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising(https://arxiv.org/abs/2603.02012)
Keywords: restoration
Abstract: Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.
摘要：低剂量正电子发射断层扫描 (PET) 可减少辐射暴露，但会遭受严重的噪音和定量退化。基于扩散的去噪模型实现了强大的最终重建，但其反向轨迹通常不受约束，并且与 PET 剂量形成的渐进性质不相符。我们提出了 MAP-Diff，一种用于渐进式 3D 全身 PET 去噪的多锚引导扩散框架。 MAP-Diff 引入了临床观察到的中间剂量扫描作为轨迹锚点，并强制执行时间步长相关的监督，以规范剂量对齐中间状态的反向过程。锚时间步通过模拟扩散损坏和真实多剂量 PET 对之间的退化匹配进行校准，并且时间步加权锚损失稳定了阶段式学习。由此推断，该模型仅需要超低剂量输入，同时能够实现渐进的、剂量一致的中间恢复。内部（Siemens Biograph Vision Quadra）和跨扫描仪（United Imaging uEXPLORER）数据集的实验表明，与基于 CNN、Transformer、GAN 和扩散的强大基线相比，得到了一致的改进。在内部数据集上，与 3D DDPM 相比，MAP-Diff 将 PSNR 从 42.48 dB 提高到 43.71 dB (+1.23 dB)，将 SSIM 提高到 0.986，并将 NMAE 从 0.115 降低到 0.103 (-0.012)。性能提升遍及扫描仪，在外部队列中实现 34.42 dB PSNR 和 0.141 NMAE，优于所有竞争方法。

Title: Latent attention on masked patches for flow reconstruction

Authors: Ben Eze, Luca Magri, Andrea Nóvoa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02028
Pdf URL: https://arxiv.org/pdf/2603.02028
Copy Paste: [[2603.02028]] Latent attention on masked patches for flow reconstruction(https://arxiv.org/abs/2603.02028)
Keywords: generation
Abstract: Vision transformers have demonstrated outstanding performance on image generation applications, but their adoption in scientific disciplines, like fluid dynamics, has been limited. We introduce the Latent Attention on Masked Patches (LAMP) model, an interpretable regression-based modified vision transformer designed for masked flow reconstruction. LAMP follows a three-fold strategy: (i) partition of each flow snapshot into patches, (ii) dimensionality reduction of each patch via patch-wise proper orthogonal decomposition, and (iii) reconstruction of the full field from a masked input using a single-layer transformer trained via closed-form linear regression. We test the method on two canonical 2D unsteady wakes: a wake past a bluff body, and a chaotic wake past a flat plate. We show that the LAMP accurately reconstructs the full flow field from a 90\%-masked and noisy input, across signal-to-noise ratios between 10 and 30\,dB. Incorporating nonlinear measurement states can reduce the prediction error by up to an order of magnitude. The learned attention matrix yields physically interpretable multi-fidelity optimal sensor-placement maps. The modularity of the framework enables nonlinear compression and deep attention blocks, thereby providing an efficient baseline for nonlinear and high-dimensional masked flow reconstruction.
摘要：视觉转换器在图像生成应用中表现出了出色的性能，但它们在流体动力学等科学学科中的采用受到限制。我们引入了掩蔽补丁上的潜在注意力（LAMP）模型，这是一种可解释的基于回归的改进视觉变换器，专为掩蔽流重建而设计。 LAMP 遵循三重策略：（i）将每个流快照划分为补丁，（ii）通过补丁正确正交分解对每个补丁进行降维，以及（iii）使用通过闭式线性回归训练的单层变压器从屏蔽输入重建全场。我们在两个典型的二维非稳态尾流上测试了该方法：经过钝体的尾流和经过平板的混沌尾流。我们证明，LAMP 从 90% 掩蔽和噪声输入中准确地重建了完整的流场，信噪比在 10 到 30dB 之间。结合非线性测量状态可以将预测误差降低一个数量级。学习到的注意力矩阵产生物理上可解释的多保真度最佳传感器放置图。该框架的模块化可以实现非线性压缩和深度关注块，从而为非线性和高维掩蔽流重建提供有效的基线。

Title: Expanding LLM Agent Boundaries with Strategy-Guided Exploration

Authors: Andrew Szot, Michael Kirchhof, Omar Attia, Alexander Toshev
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02045
Pdf URL: https://arxiv.org/pdf/2603.02045
Copy Paste: [[2603.02045]] Expanding LLM Agent Boundaries with Strategy-Guided Exploration(https://arxiv.org/abs/2603.02045)
Keywords: generation
Abstract: Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.
摘要：强化学习 (RL) 在作为计算机使用、工具调用和编码等任务的代理的大型语言模型 (LLM) 训练后取得了显着的成功。然而，对于 LLM 智能体来说，探索仍然是强化学习中的一个核心挑战，特别是当它们在具有复杂观察和稀疏结果奖励的语言动作空间中运行时。在这项工作中，我们通过利用法学硕士以语言对环境进行规划和推理的能力，将探索从低级行动转变为高级语言策略，来解决对法学硕士代理的探索。因此，我们提出策略引导探索（SGE），它首先生成一个简洁的自然语言策略，描述要做什么才能实现目标，然后生成以该策略为条件的环境行动。通过在战略空间而不是行动空间中进行探索，SGE 引发了针对不同环境结果的结构化和多样化的探索。为了增加强化学习期间策略的多样性，SGE 引入了混合温度采样，它并行探索不同的策略，以及策略反思过程，该过程将策略生成基于环境中先前策略的结果。在 UI 交互、工具调用、编码和实体代理环境中，SGE 始终优于以探索为中心的 RL 基线，提高了学习效率和最终性能。我们表明，SGE 使智能体能够学习解决对于基本模型而言过于困难的任务。

Title: NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis

Authors: Manuel Serna-Aguilera, Raegan Anderes, Page Dobbs, Khoa Luu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02047
Pdf URL: https://arxiv.org/pdf/2603.02047
Copy Paste: [[2603.02047]] NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis(https://arxiv.org/abs/2603.02047)
Keywords: generation
Abstract: The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.
摘要：尼古丁成瘾公共卫生危机仍然普遍存在。仅在本世纪，烟草业就发布并营销了新产品，积极努力吸引新的年轻客户。这些创新和产品开发，即调味尼古丁或烟草，如尼古丁袋，已经毁掉了多年的反烟草运动工作。过去的工作在范围和连接大规模数据点的能力方面都受到限制。因此，我们引入了尼古丁创新反攻 (NICO) 数据集，为公共卫生研究人员提供超过 200,000 个多模式样本，包括 55 个烟草和尼古丁产品品牌的图像和文本描述。此外，为了向公共卫生研究人员提供跨大规模数据集的事实联系，我们提出了NICO-RAG，这是一种检索增强生成（RAG）框架，可以检索图像特征，而不会产生语言模型的高成本，以及使用NICO等大规模数据集处理图像标记的额外成本。在构建时，NICO-RAG 将图像和文本提取的实体和关系组织成超图，以产生尽可能真实的响应。这种联合多模态知识表示使 NICO-RAG 不仅可以通过视觉相似性，还可以通过图像描述的语义相似性来检索图像以进行查询回答。实验表明，无需为 100 多个问题处理图像中的额外标记，NICO-RAG 的性能与适用于图像的最先进的 RAG 方法相当。

Title: WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Authors: Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02049
Pdf URL: https://arxiv.org/pdf/2603.02049
Copy Paste: [[2603.02049]] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories(https://arxiv.org/abs/2603.02049)
Keywords: generation
Abstract: Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
摘要：基础视频扩散模型 (VDM) 的最新进展取得了重大进展。然而，尽管生成的视频具有出色的视觉质量，但由于摄像机的可控性有限，并且从不同的摄像机轨迹观看时生成的内容不一致，因此从这些输出中重建一致的 3D 场景仍然具有挑战性。在本文中，我们提出了 WorldStereo，这是一种新颖的框架，通过两个专用的几何存储模块将相机引导的视频生成和 3D 重建联系起来。从形式上来说，全局几何存储器可以实现精确的相机控制，同时通过增量更新的点云注入粗略的结构先验。此外，空间立体记忆通过 3D 对应限制模型的注意力接受域，以关注记忆库中的细粒度细节。这些组件使 WorldStereo 能够在精确的摄像机控制下生成多视图一致的视频，从而促进高质量的 3D 重建。此外，基于分支的灵活控制 WorldStereo 显示了令人印象深刻的效率，受益于无需联合训练的分布匹配蒸馏 VDM 主干。相机引导视频生成和 3D 重建基准的大量实验证明了我们方法的有效性。值得注意的是，我们展示了 WorldStereo 作为一个强大的世界模型，以高保真 3D 结果处理不同的场景生成任务（无论是从透视图像还是全景图像开始）。模型将被发布。

Title: ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks

Authors: Joël Küchler, Ellen van Maren, Vaiva Vasiliauskaitė, Katarina Vulić, Reza Abbasi-Asl, Stephan J. Ihle
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02063
Pdf URL: https://arxiv.org/pdf/2603.02063
Copy Paste: [[2603.02063]] ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks(https://arxiv.org/abs/2603.02063)
Keywords: generation, generative
Abstract: Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.
摘要：尽管数据生成通常很简单，但从数据中提取信息却更加困难。以对象为中心的表示学习可以以无监督的方式从图像中提取信息。它通过将图像分割为其子组件：对象来实现这一点。然后，每个对象都表示在可用于下游处理的低维潜在空间中。以对象为中心的表示学习由自动编码器架构（AE）主导。在这里，我们提出了 ORGAN，一种以对象为中心的表示学习的新方法，它基于循环一致的生成对抗网络。我们表明，它在合成数据集上的表现与其他最先进的方法类似，同时是这里测试的唯一能够处理具有许多对象和低视觉对比度的更具挑战性的现实世界数据集的方法。作为对这些结果的补充，ORGAN 创建了允许对象操作的富有表现力的潜在空间表示。最后，我们证明 ORGAN 在对象数量和图像大小方面都具有良好的扩展性，这使其比当前最先进的方法具有独特的优势。

Title: LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

Authors: Hualiang Wei, Shunran Jia, Jialun Liu, Wenhui Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02129
Pdf URL: https://arxiv.org/pdf/2603.02129
Copy Paste: [[2603.02129]] LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation(https://arxiv.org/abs/2603.02129)
Keywords: generative
Abstract: We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
摘要：我们提出了 LiftAvatar，这是一种新的范式，可以完成运动空间中的稀疏单眼观察（例如面部表情和头部姿势），并使用完成的信号来驱动高保真头像动画。 LiftAvatar 是一种细粒度、表情可控的大规模视频扩散 Transformer，可根据单个或多个参考图像合成高质量、时间连贯的表情序列。关键思想是将不完整的输入数据提升为更丰富的运动学表示，从而加强下游 3D 头像管道中的重建和动画。为此，我们引入了（i）多粒度表达控制方案，将着色图与表达系数相结合，以实现精确稳定的驱动，以及（ii）多参考调节机制，聚合来自多个帧的互补线索，从而实现强大的3D一致性和可控性。作为即插即用的增强器，LiftAvatar 直接解决了日常单眼视频中由于稀疏运动线索而导致的基于 3D 高斯 Splatting 的化身的有限表现力和重建伪影问题。通过将不完整的观察扩展到不同的姿势表达变化，LiftAvatar 还能够将大规模视频生成模型有效地预先提炼为 3D 管道，从而带来可观的收益。大量实验表明，LiftAvatar 持续提高了最先进的 3D 化身方法的动画质量和定量指标，尤其是在极端的、看不见的表情下。

Title: SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

Authors: Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, Yueqi Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02133
Pdf URL: https://arxiv.org/pdf/2603.02133
Copy Paste: [[2603.02133]] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos(https://arxiv.org/abs/2603.02133)
Keywords: generation
Abstract: Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.
摘要：组合场景重建旨在创建以对象为中心的表示，而不是来自现实世界视频的整体场景，这本身适用于模拟和交互。传统的构图重建方法主要强调视觉外观，对现实场景的泛化能力有限。在本文中，我们提出了 SimRecon，这是一个实现杂乱场景重建的“感知-生成-模拟”流程的框架，它首先根据视频输入进行场景级语义重建，然后执行单个对象生成，最后在模拟器中组装这些资产。然而，天真地将这三个阶段结合起来会导致生成的资产在视觉上不真实，并且最终场景在物理上不可信，这对于复杂场景来说尤其严重。因此，我们进一步提出三个阶段之间的两个桥接模块来解决这个问题。具体来说，对于从感知到生成的过渡（对于视觉保真度至关重要），我们引入了主动视点优化，它会在 3D 空间中主动搜索以获取最佳投影图像，作为单对象完成的条件。此外，对于从生成到模拟的过渡（对于物理合理性至关重要），我们提出了场景图合成器，它指导 3D 模拟器从头开始构建，反映现实世界的原生构建原则。在 ScanNet 数据集上进行的大量实验验证了我们的方法比以前最先进的方法具有优越的性能。

Title: OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Authors: Yiying Yang, Wei Cheng, Sijin Chen, Honghao Fu, Xianfang Zeng, Yujun Cai, Gang Yu, Xingjun Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02138
Pdf URL: https://arxiv.org/pdf/2603.02138
Copy Paste: [[2603.02138]] OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens(https://arxiv.org/abs/2603.02138)
Keywords: generation
Abstract: OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
摘要：OmniLottie 是一个多功能框架，可以从多模式指令生成高质量的矢量动画。对于灵活的运动和视觉内容控制，我们重点关注 Lottie，这是一种用于形状和动画行为表示的轻量级 JSON 格式。然而，原始 Lottie JSON 文件包含大量不变的结构元数据和格式化标记，这给学习矢量动画生成带来了重大挑战。因此，我们引入了一个精心设计的 Lottie 分词器，它将 JSON 文件转换为表示形状、动画函数和控制参数的命令和参数的结构化序列。这种分词器使我们能够在预训练的视觉语言模型上构建 OmniLottie，以遵循多模式交错指令并生成高质量的矢量动画。为了进一步推进矢量动画生成的研究，我们策划了 MMLottie-2M，这是一个专业设计的矢量动画以及文本和视觉注释的大型数据集。通过大量实验，我们验证 OmniLottie 可以生成生动且语义一致的矢量动画，这些动画紧密遵循多模式人类指令。

Title: GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

Authors: Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02172
Pdf URL: https://arxiv.org/pdf/2603.02172
Copy Paste: [[2603.02172]] GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis(https://arxiv.org/abs/2603.02172)
Keywords: generation, generative
Abstract: We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.
摘要：我们介绍 GeoDiT，这是一种扩散转换器，设计用于通过基于点的控制生成文本到卫星图像。现有的受控卫星图像生成模型通常需要像素级地图，这些地图的获取非常耗时，但语义上受到限制。为了解决这个限制，我们引入了一种新颖的基于点的调节框架，该框架通过点的空间位置和与每个点相关的文本描述来控制生成过程，提供语义丰富的控制信号。这种方法可以为卫星图像生成提供灵活、注释友好且计算简单的推理。为此，我们引入了一种自适应局部注意力机制，可以根据输入点查询有效地规范注意力分数。我们系统地评估了用于训练 GeoDiT 的各种特定领域的设计选择，包括选择用于对齐的卫星图像表示和用于调节的地理定位表示。我们的实验表明，GeoDiT 实现了令人印象深刻的生成性能，超越了最先进的遥感生成模型。

Title: Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Authors: Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02175
Pdf URL: https://arxiv.org/pdf/2603.02175
Copy Paste: [[2603.02175]] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance(https://arxiv.org/abs/2603.02175)
Keywords: generation, generative
Abstract: Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at this https URL.
摘要：基于指令的视频编辑取得了快速进展，但当前的方法常常难以实现精确的视觉控制，因为自然语言在描述复杂的视觉细微差别方面本质上受到限制。尽管参考引导编辑提供了一个强大的解决方案，但其潜力目前因缺乏高质量配对训练数据而受到瓶颈。为了弥补这一差距，我们引入了一个可扩展的数据生成管道，将现有的视频编辑对转换为高保真训练四元组，利用图像生成模型创建合成的参考支架。利用这个管道，我们构建了RefVIE（一个为指令参考跟踪任务量身定制的大规模数据集），并建立了RefVIE-Bench进行综合评估。此外，我们提出了一个统一的编辑架构 Kiwi-Edit，它可以协同可学习的查询和潜在的视觉特征以提供参考语义指导。我们的模型通过渐进的多阶段培训课程，在指令遵循和参考保真度方面取得了显着的进步。大量的实验表明，我们的数据和架构在可控视频编辑方面建立了新的最先进技术。所有数据集、模型和代码均在此 https URL 发布。

Title: Multi-Head Low-Rank Attention

Authors: Songtao Liu, Hongwu Peng, Zhiwei Zhang, Zhengyu Chen, Yue Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02188
Pdf URL: https://arxiv.org/pdf/2603.02188
Copy Paste: [[2603.02188]] Multi-Head Low-Rank Attention(https://arxiv.org/abs/2603.02188)
Keywords: generation
Abstract: Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA. Code is available at this https URL. Pretrained weights, along with the training and evaluation data, are available at this https URL.
摘要：大型语言模型中的长上下文推理在解码阶段受到键值 (KV) 缓存加载的瓶颈，其中生成的顺序性质需要在每一步重复将 KV 缓存从片外高带宽存储器 (HBM) 传输到片上静态随机存取存储器 (SRAM)。虽然多头潜在注意力 (MLA) 显着减少了总 KV 缓存大小，但它在通过张量并行 (TP) 进行分布式解码期间遇到了分片瓶颈。由于其单个潜在头无法分区，因此每个设备都被迫为每个令牌冗余加载完整的 KV 缓存，从而消耗过多的内存流量并减少权重分片等 TP 优势。在这项工作中，我们提出了多头低秩注意力（MLRA），它可以实现可分区的潜在状态，以实现高效的 4 路 TP 解码。大量实验表明，MLRA 实现了最先进的复杂性和下游任务性能，同时还比 MLA 提供了 2.8$\times$ 的解码速度。代码可从此 https URL 获取。预训练权重以及训练和评估数据可从此 https URL 获取。