2026-03-24

Title: Remote Sensing Image Dehazing: A Systematic Review of Progress, Challenges, and Prospects

Authors: Heng Zhou, Xiaoxiong Liu, Zhenxi Zhang, Jieheng Yun, Chengyang Li, Yunchu Yang, Dongyi Xia, Chunna Tian, Xiao-Jun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20289
Pdf URL: https://arxiv.org/pdf/2603.20289
Copy Paste: [[2603.20289]] Remote Sensing Image Dehazing: A Systematic Review of Progress, Challenges, and Prospects(https://arxiv.org/abs/2603.20289)
Keywords: restoration, generation
Abstract: Remote sensing images (RSIs) are frequently degraded by haze, fog, and thin clouds, which obscure surface reflectance and hinder downstream applications. This study presents the first systematic and unified survey of RSIs dehazing, integrating methodological evolution, benchmark assessment, and physical consistency analysis. We categorize existing approaches into a three-stage progression: from handcrafted physical priors, to data-driven deep restoration, and finally to hybrid physical-intelligent generation, and summarize more than 30 representative methods across CNNs, GANs, Transformers, and diffusion models. To provide a reliable empirical reference, we conduct large-scale quantitative experiments on five public datasets using 12 metrics, including PSNR, SSIM, CIEDE, LPIPS, FID, SAM, ERGAS, UIQI, QNR, NIQE, and HIST. Cross-domain comparison reveals that recent Transformer- and diffusion-based models improve SSIM by 12%~18% and reduce perceptual errors by 20%~35% on average, while hybrid physics-guided designs achieve higher radiometric stability. A dedicated physical radiometric consistency experiment further demonstrates that models with explicit transmission or airlight constraints reduce color bias by up to 27%. Based on these findings, we summarize open challenges: dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations, and outline promising research directions for future development of trustworthy, controllable, and efficient (TCE) dehazing systems. All reviewed resources, including source code, benchmark datasets, evaluation metrics, and reproduction configurations are publicly available at this https URL.
摘要：遥感图像 (RSI) 经常因霾、雾和薄云而降质，从而模糊表面反射率并阻碍下游应用。本研究首次对 RSI 除雾进行系统、统一的调查，整合了方法论演变、基准评估和物理一致性分析。我们将现有方法分为三个阶段：从手工物理先验，到数据驱动的深度恢复，最后到混合物理-智能生成，并总结了 CNN、GAN、Transformers 和扩散模型的 30 多种代表性方法。为了提供可靠的实证参考，我们使用 PSNR、SSIM、CIEDE、LPIPS、FID、SAM、ERGAS、UIQI、QNR、NIQE 和 HIST 等 12 个指标在 5 个公共数据集上进行了大规模定量实验。跨域比较表明，最近基于 Transformer 和扩散的模型将 SSIM 提高了 12%~18%，平均感知误差减少了 20%~35%，而混合物理引导设计实现了更高的辐射稳定性。专门的物理辐射一致性实验进一步证明，具有显式透射或空气光约束的模型可将颜色偏差减少高达 27%。基于这些发现，我们总结了开放的挑战：动态大气建模、多模态融合、轻量级部署、数据稀缺和联合退化，并概述了未来发展可信、可控和高效（TCE）去雾系统的有前途的研究方向。所有经过审查的资源，包括源代码、基准数据集、评估指标和复制配置，均可在此 https URL 上公开获取。

Title: Transparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly

Authors: Qihao Lin, Borui Chen, Yuping Zhou, Jianing Wu, Yulan Guo, Weishi Zheng, Chongkun Xia
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2603.20290
Pdf URL: https://arxiv.org/pdf/2603.20290
Copy Paste: [[2603.20290]] Transparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly(https://arxiv.org/abs/2603.20290)
Keywords: restoration, generation
Abstract: The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment's contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at this https URL.
摘要：透明碎片的轮廓估计对于自主重组非常重要，特别是在精密光学仪器修复、文物修复以及其他珍贵设备破损事故识别等领域。与一般完整的透明物体不同，透明碎片的轮廓估计由于严格的光学特性、不规则的形状和边缘而面临更大的挑战。为了解决这个问题，本文提出了一种基于视觉-触觉融合的通用透明片段轮廓估计框架。首先，我们构建了名为 TransFrag27K 的透明片段数据集，其中包括来自多种类型的透明对象的破碎片段的多场景合成数据，以及可扩展的合成数据生成管道。其次，我们提出了一种名为 TransFragNet 的视觉抓取位置检测网络来识别、定位和分割采样抓取位置。并且，我们使用带有 Gelsight Mini 传感器的两指夹具来获取碎片横向边缘的重建触觉信息。通过将触觉信息与视觉线索融合，提出了视觉-触觉融合材料分类器。受人类结合视觉和触觉估计片段轮廓的方式的启发，我们引入了一种基于视觉-触觉融合的通用透明片段轮廓估计框架，在现实世界验证中展示了强大的性能。最后，提出了一种基于多维相似度度量的轮廓匹配和重组算法，为评估视觉触觉轮廓估计和片段重组提供了可重复的基准。实验结果证明了所提出框架的有效性。数据集和代码可从此 https URL 获取。

Title: MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery

Authors: Dong Li, Zhengzhang Chen, Xujiang Zhao, Linlin Yu, Zhong Chen, Yi He, Haifeng Chen, Chen Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20295
Pdf URL: https://arxiv.org/pdf/2603.20295
Copy Paste: [[2603.20295]] MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery(https://arxiv.org/abs/2603.20295)
Keywords: generation
Abstract: Uncovering causal structures from observational data is crucial for understanding complex systems and making informed decisions. While reinforcement learning (RL) has shown promise in identifying these structures in the form of a directed acyclic graph (DAG), existing methods often lack efficiency, making them unsuitable for online applications. In this paper, we propose MARLIN, an efficient multi agent RL based approach for incremental DAG learning. MARLIN uses a DAG generation policy that maps a continuous real valued space to the DAG space as an intra batch strategy, then incorporates two RL agents state specific and state invariant to uncover causal relationships and integrates these agents into an incremental learning framework. Furthermore, the framework leverages a factored action space to enhance parallelization efficiency. Extensive experiments on synthetic and real datasets demonstrate that MARLIN outperforms state of the art methods in terms of both efficiency and effectiveness.
摘要：从观测数据中揭示因果结构对于理解复杂系统和做出明智的决策至关重要。虽然强化学习 (RL) 在以有向无环图 (DAG) 的形式识别这些结构方面表现出了良好的前景，但现有方法通常缺乏效率，使得它们不适合在线应用。在本文中，我们提出了 MARLIN，一种基于多智能体强化学习的高效增量 DAG 学习方法。 MARLIN 使用 DAG 生成策略，将连续实值空间映射到 DAG 空间作为批内策略，然后合并两个状态特定和状态不变的 RL 代理来揭示因果关系，并将这些代理集成到增量学习框架中。此外，该框架利用分解的动作空间来提高并行化效率。对合成数据集和真实数据集的大量实验表明，MARLIN 在效率和有效性方面均优于最先进的方法。

Title: InjectFlow: Weak Guides Strong via Orthogonal Injection for Flow Matching

Authors: Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20303
Pdf URL: https://arxiv.org/pdf/2603.20303
Copy Paste: [[2603.20303]] InjectFlow: Weak Guides Strong via Orthogonal Injection for Flow Matching(https://arxiv.org/abs/2603.20303)
Keywords: generation, generative
Abstract: Flow Matching (FM) has recently emerged as a leading approach for high-fidelity visual generation, offering a robust continuous-time alternative to ordinary differential equation (ODE) based models. However, despite their success, FM models are highly sensitive to dataset biases, which cause severe semantic degradation when generating out-of-distribution or minority-class samples. In this paper, we provide a rigorous mathematical formalization of the ``Bias Manifold'' within the FM framework. We identify that this performance drop is driven by conditional expectation smoothing, a mechanism that inevitably leads to trajectory lock-in during inference. To resolve this, we introduce InjectFlow, a novel, training-free method by injecting orthogonal semantics during the initial velocity field computation, without requiring any changes to the random seeds. This design effectively prevents the latent drift toward majority modes while maintaining high generative quality. Extensive experiments demonstrate the effectiveness of our approach. Notably, on the GenEval dataset, InjectFlow successfully fixes 75% of the prompts that standard flow matching models fail to generate correctly. Ultimately, our theoretical analysis and algorithm provide a ready-to-use solution for building more fair and robust visual foundation models.
摘要：流匹配 (FM) 最近已成为高保真视觉生成的领先方法，为基于常微分方程 (ODE) 的模型提供了强大的连续时间替代方案。然而，尽管 FM 模型取得了成功，但它对数据集偏差高度敏感，这在生成分布外或少数类样本时会导致严重的语义退化。在本文中，我们在 FM 框架内提供了“偏置流形”的严格数学形式化。我们发现这种性能下降是由条件期望平滑驱动的，这是一种不可避免地导致推理过程中轨迹锁定的机制。为了解决这个问题，我们引入了 InjectFlow，这是一种新颖的免训练方法，通过在初始速度场计算期间注入正交语义，而不需要对随机种子进行任何更改。这种设计有效地防止了向多数模式的潜在漂移，同时保持了高生成质量。大量的实验证明了我们方法的有效性。值得注意的是，在 GenEval 数据集上，InjectFlow 成功修复了标准流匹配模型无法正确生成的 75% 的提示。最终，我们的理论分析和算法为构建更公平、更稳健的视觉基础模型提供了即用型解决方案。

Title: Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges

Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Thuc D. Nguyen, Nhien-An Le-Khac
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20304
Pdf URL: https://arxiv.org/pdf/2603.20304
Copy Paste: [[2603.20304]] Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges(https://arxiv.org/abs/2603.20304)
Keywords: generation
Abstract: As diffusion models (DMs) enable photorealistic image generation at unprecedented scale, watermarking techniques have become essential for provenance establishment and accountability. Existing methods face challenges: sampling-based approaches operate on frozen models but require costly $N$-step Denoising Diffusion Implicit Models (DDIM) inversion (typically N=50) for zero-bit-only detection; fine-tuning-based methods achieve fast multi-bit extraction but couple the watermark to a specific model checkpoint, requiring retraining for each architecture. We propose DiffMark, a plug-and-play watermarking method that offers three key advantages over existing approaches: single-pass multi-bit detection, per-image key flexibility, and cross-model transferability. Rather than encoding the watermark into the initial noise vector, DiffMark injects a persistent learned perturbation $\delta$ at every denoising step of a completely frozen DM. The watermark signal accumulates in the final denoised latent $z_0$ and is recovered in a single forward pass. The central challenge of backpropagating gradients through a frozen UNet without traversing the full denoising chain is addressed by employing Latent Consistency Models (LCM) as a differentiable training bridge. This reduces the number of gradient steps from 50 DDIM to 4 LCM and enables a single-pass detection at 16.4 ms, a 45x speedup over sampling-based methods. Moreover, by this design, the encoder learns to map any runtime secret to a unique perturbation at inference time, providing genuine per-image key flexibility and transferability to unseen diffusion-based architectures without per-model fine-tuning. Although achieving these advantages, DiffMark also maintains competitive watermark robustness against distortion, regeneration, and adversarial attacks.
摘要：由于扩散模型 (DM) 能够以前所未有的规模生成逼真的图像，因此水印技术对于来源建立和问责制变得至关重要。现有方法面临挑战：基于采样的方法在冻结模型上运行，但需要昂贵的 $N$ 步去噪扩散隐式模型 (DDIM) 反演（通常 N=50）来进行仅零位检测；基于微调的方法实现了快速多位提取，但将水印耦合到特定的模型检查点，需要对每个架构进行重新训练。我们提出了 DiffMark，这是一种即插即用的水印方法，与现有方法相比，它具有三个关键优势：单通道多位检测、每个图像密钥的灵活性和跨模型可转移性。 DiffMark 不是将水印编码到初始噪声向量中，而是在完全冻结的 DM 的每个去噪步骤中注入持久的学习扰动 $\delta$。水印信号累积在最终的去噪潜在 $z_0$ 中，并在单次前向传递中恢复。通过使用潜在一致性模型 (LCM) 作为可微训练桥来解决通过冻结 UNet 反向传播梯度而不遍历完整去噪链的核心挑战。这将梯度步数从 50 DDIM 减少到 4 LCM，并实现 16.4 ms 的单遍检测，比基于采样的方法加速了 45 倍。此外，通过这种设计，编码器学会将任何运行时秘密映射到推理时的独特扰动，从而提供真正的每图像密钥灵活性和可转移性到看不见的基于扩散的架构，而无需针对每个模型进行微调。尽管实现了这些优势，DiffMark 还保持了具有竞争力的水印鲁棒性，以防止失真、再生和对抗性攻击。

Title: EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Authors: Yuzhe Weng, Haotian Wang, Yuanhong Yu, Jun Du, Shan He, Xiaoyan Wu, Haoran Xu
Subjects: cs.CV, cs.AI, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.20307
Pdf URL: https://arxiv.org/pdf/2603.20307
Copy Paste: [[2603.20307]] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control(https://arxiv.org/abs/2603.20307)
Keywords: generation
Abstract: Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.
摘要：音频驱动的头像生成旨在从静态肖像和语音创建生动逼真的视频。现有的基于 AR 的方法依赖于中间的面部表征，这限制了它们的表现力和真实感。同时，基于扩散的方法逐个剪辑生成，缺乏细粒度的控制，并且由于整个窗口的整体去噪而导致固有的延迟。为了解决这些限制，我们提出了 EARTalking，这是一种新颖的端到端 GPT 式自回归模型，用于交互式音频驱动的头部特写生成。我们的方法引入了一种新颖的逐帧、上下文、音频驱动的流生成范例。为了本质上支持具有身份一致性的可变长度视频生成，我们提出了接收器帧窗口注意（SFA）机制。此外，为了避免先前工作中不同控制信号所需的复杂、独立的网络，我们提出了一种流式帧条件上下文（FCIC）方案。该方案以流式、上下文方式有效地注入不同的控制信号，从而实现每一帧和任意时刻的交互式控制。实验表明，EARTalking 的性能优于现有的自回归方法，并且达到了与基于扩散的方法相当的性能。我们的工作证明了上下文流自回归控制的可行性，为灵活、高效的生成解锁了可扩展的方向。该代码将被发布以确保可重复性。

Title: Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Authors: Liu hung ming
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.20327
Pdf URL: https://arxiv.org/pdf/2603.20327
Copy Paste: [[2603.20327]] Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations(https://arxiv.org/abs/2603.20327)
Keywords: generative
Abstract: Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations -- not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036--0.117 bits, NMI 1.2--3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.
摘要：使用联合嵌入预测架构 (JEPA) 训练的视频世界模型通过预测潜在空间中的屏蔽区域而不是重建像素来获取丰富的时空表示。这消除了生成模型的视觉验证路径，从而产生了结构可解释性差距：编码器已经学习了任何可检查形式都无法访问的物理结构。现有的探测方法要么在没有结构化中间层的连续空间中运行，要么附加其参数混淆行为归因于编码器的生成组件。我们提出将 AI Mother Tongue (AIM) 框架作为被动量化探针：一种轻量级、无词汇探针，可将 V-JEPA 2 连续潜在向量转换为离散符号序列，无需特定于任务的监督或修改编码器。由于编码器保持完全冻结，因此 AIM 码本中的任何符号结构都完全归因于 V-JEPA 2 预训练表示，而不是探针。我们通过 Kinetics-mini 上的类别对比实验沿着三个物理维度进行评估：抓取角度、物体几何形状和运动时间结构。 AIM 符号分布在所有三个实验中都存在显着差异（chi^2 p < 10^{-4}；MI 0.036--0.117 位，NMI 3 位最大值的 1.2--3.9%；JSD 高达 0.342；码本活跃率 62.5%）。实验表明，V-JEPA 2 潜在空间明显紧凑：不同的动作类别共享一个共同的表征核心，语义差异被编码为分级分布变化而不是分类边界。这些结果确立了通往行动条件符号世界模型的四阶段路线图的第一阶段，证明结构化符号流形是冻结 JEPA 潜在空间的可发现属性。

Title: Uni-Classifier: Leveraging Video Diffusion Priors for Universal Guidance Classifier

Authors: Yujie Zhou, Pengyang Ling, Jiazi Bu, Bingjie Gao, Li Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20382
Pdf URL: https://arxiv.org/pdf/2603.20382
Copy Paste: [[2603.20382]] Uni-Classifier: Leveraging Video Diffusion Priors for Universal Guidance Classifier(https://arxiv.org/abs/2603.20382)
Keywords: generation, generative
Abstract: In practical AI workflows, complex tasks often involve chaining multiple generative models, such as using a video or 3D generation model after a 2D image generator. However, distributional mismatches between the output of upstream models and the expected input of downstream models frequently degrade overall generation quality. To address this issue, we propose Uni-Classifier (Uni-C), a simple yet effective plug-and-play module that leverages video diffusion priors to guide the denoising process of preceding models, thereby aligning their outputs with downstream requirements. Uni-C can also be applied independently to enhance the output quality of individual generative models. Extensive experiments across video and 3D generation tasks demonstrate that Uni-C consistently improves generation quality in both workflow-based and standalone settings, highlighting its versatility and strong generalization capability.
摘要：在实际的人工智能工作流程中，复杂的任务通常涉及链接多个生成模型，例如在 2D 图像生成器之后使用视频或 3D 生成模型。然而，上游模型的输出与下游模型的预期输入之间的分布不匹配经常会降低整体发电质量。为了解决这个问题，我们提出了 Uni-Classifier (Uni-C)，这是一个简单而有效的即插即用模块，它利用视频扩散先验来指导先前模型的去噪过程，从而使它们的输出与下游需求保持一致。 Uni-C 还可以独立应用，以提高单个生成模型的输出质量。跨视频和 3D 生成任务的大量实验表明，Uni-C 在基于工作流和独立设置中持续提高生成质量，突出了其多功能性和强大的泛化能力。

Title: SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning

Authors: Y. Sungtaek Ju
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.20392
Pdf URL: https://arxiv.org/pdf/2603.20392
Copy Paste: [[2603.20392]] SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning(https://arxiv.org/abs/2603.20392)
Keywords: generation, generative
Abstract: Probabilistic circuit (PC) structure learning is hampered by greedy algorithms that make irreversible, locally optimal decisions. We propose SymCircuit, which replaces greedy search with a learned generative policy trained via entropy-regularized reinforcement learning. Instantiating the RL-as-inference framework in the PC domain, we show the optimal policy is a tempered Bayesian posterior, recovering the exact posterior when the regularization temperature is set inversely proportional to the dataset size. The policy is implemented as SymFormer, a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits at every generation step. We introduce option-level REINFORCE, restricting gradient updates to structural decisions rather than all tokens, yielding an SNR (signal to noise ratio) improvement and >10 times sample efficiency gain on the NLTCS dataset. A three-layer uncertainty decomposition (structural via model averaging, parametric via the delta method, leaf via conjugate Dirichlet-Categorical propagation) is grounded in the multilinear polynomial structure of PC outputs. On NLTCS, SymCircuit closes 93% of the gap to LearnSPN; preliminary results on Plants (69 variables) suggest scalability.
摘要：概率电路 (PC) 结构学习受到贪婪算法的阻碍，贪婪算法会做出不可逆的局部最优决策。我们提出了 SymCircuit，它将贪婪搜索替换为通过熵正则化强化学习训练的学习生成策略。在 PC 域中实例化 RL-as-inference 框架，我们表明最佳策略是调节贝叶斯后验，当正则化温度设置为与数据集大小成反比时恢复精确的后验。该策略以 SymFormer 的形式实现，这是一种语法约束的自回归 Transformer，具有树相关自注意力，可保证每个生成步骤的电路有效。我们引入了选项级 REINFORCE，将梯度更新限制为结构决策而不是所有标记，从而在 NLTCS 数据集上产生 SNR（信噪比）改进和超过 10 倍的样本效率增益。三层不确定性分解（通过模型平均进行结构分析，通过 delta 方法进行参数分析，通过共轭狄利克雷分类传播进行叶分析）基于 PC 输出的多线性多项式结构。在 NLTCS 上，SymCircuit 与 LearnSPN 缩小了 93% 的差距；植物（69 个变量）的初步结果表明可扩展性。

Title: KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Authors: Yichun Xu, Navjot K. Khaira, Tejinder Singh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20397
Pdf URL: https://arxiv.org/pdf/2603.20397
Copy Paste: [[2603.20397]] KV Cache Optimization Strategies for Scalable and Efficient LLM Inference(https://arxiv.org/abs/2603.20397)
Keywords: generation
Abstract: The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical performance across memory reduction, throughput, and model accuracy metrics. We further map techniques to seven practical deployment scenarios, including long-context single requests, high-throughput datacenter serving, edge devices, multi-turn conversations, and accuracy-critical reasoning, providing actionable guidance for practitioners selecting among competing approaches. Our analysis reveals that no single technique dominates across all settings; instead, the optimal strategy depends on context length, hardware constraints, and workload characteristics, pointing toward adaptive, multi-stage optimization pipelines as a promising direction for future research.
摘要：键值 (KV) 缓存是基于 Transformer 的大型语言模型 (LLM) 的基础优化，消除了自回归生成过程中对过去标记表示的冗余重新计算。然而，其内存占用量随上下文长度线性扩展，当生产 LLM 将上下文窗口从数千个令牌推至数百万个令牌时，对 GPU 内存容量、内存带宽和推理吞吐量造成了关键瓶颈。因此，高效的 KV 缓存管理已成为可扩展 LLM 部署的首要挑战。本文系统回顾了最新的 KV 缓存优化技术，将其分为五个主要方向：缓存驱逐、缓存压缩、混合内存解决方案、新颖的注意力机制和组合策略。对于每个类别，我们分析了底层机制、部署权衡以及内存减少、吞吐量和模型准确性指标的经验性能。我们进一步将技术映射到七个实际部署场景，包括长上下文单一请求、高吞吐量数据中心服务、边缘设备、多轮对话和准确性关键推理，为从业者在竞争方法中进行选择提供可行的指导。我们的分析表明，没有任何一种技术能够在所有环境中占据主导地位。相反，最优策略取决于上下文长度、硬件限制和工作负载特征，这表明自适应、多阶段优化管道是未来研究的一个有希望的方向。

Title: Thinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation

Authors: Marcus Armstrong, Navid Ayoobi, Arjun Mukherjee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20406
Pdf URL: https://arxiv.org/pdf/2603.20406
Copy Paste: [[2603.20406]] Thinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation(https://arxiv.org/abs/2603.20406)
Keywords: generation
Abstract: We investigate whether independently trained language models converge to geometrically compatible latent representations, and whether this compatibility can be exploited to correct model behavior at inference time without any weight updates. We learn a linear projection matrix that maps activation vectors from a large teacher model into the coordinate system of a smaller student model, then intervene on the student's residual stream during generation by substituting its internal state with the translated teacher representation. Across a fully crossed experimental matrix of 20 heterogeneous teacher-student pairings spanning mixture-of-experts, dense, code-specialized, and synthetically trained architectures, the Ridge projection consistently achieves R^2 = 0.50 on verbal reasoning and R^2 = 0.40 on mathematical reasoning, collapsing to R^2 = -0.22 under permutation control and R^2 = 0.01 under L_1 regularization. Behavioral correction rates range from 14.0% to 50.0% on TruthfulQA (mean 25.2%) and from 8.5% to 43.3% on GSM8K arithmetic reasoning (mean 25.5%), demonstrating that the method generalizes across fundamentally different reasoning domains. We report a near-zero correlation between geometric alignment quality and behavioral correction rate (r = -0.07), revealing a dissociation between representation space fidelity and output space impact. Intervention strength is architecture-specific: student models exhibit characteristic sensitivity profiles that invert across domains, with the most steerable verbal student becoming the least steerable mathematical student. Finally, a double dissociation experiment conducted across all 20 model pairings confirms without exception that projection matrices collapse catastrophically when transferred across reasoning domains (mean R^2 = -3.83 in both transfer directions), establishing domain-specific subspace geometry as a universal property of LMs.
摘要：我们研究独立训练的语言模型是否收敛到几何兼容的潜在表示，以及是否可以利用这种兼容性在推理时纠正模型行为而无需任何权重更新。我们学习一个线性投影矩阵，它将激活向量从大型教师模型映射到较小学生模型的坐标系中，然后在生成过程中通过用翻译后的教师表示替换其内部状态来干预学生的残差流。在由 20 个异构师生配对组成的完全交叉的实验矩阵中，涵盖专家混合、密集、代码专业和综合训练的架构，岭投影在语言推理上始终达到 R^2 = 0.50，在数学推理上达到 R^2 = 0.40，在排列控制下崩溃到 R^2 = -0.22，在 L_1 正则化下崩溃到 R^2 = 0.01。在 TruthfulQA 上，行为纠正率范围为 14.0% 到 50.0%（平均 25.2%），在 GSM8K 算术推理上，行为纠正率为 8.5% 到 43.3%（平均 25.5%），这表明该方法可以泛化到根本不同的推理领域。我们报告了几何对齐质量和行为校正率之间接近于零的相关性（r = -0.07），揭示了表示空间保真度和输出空间影响之间的分离。干预强度是特定于体系结构的：学生模型表现出跨领域反转的特征敏感性配置文件，最易操控的语言学生变成最难操控的数学学生。最后，对所有 20 个模型配对进行的双解离实验无一例外地证实，当跨推理域转移时（两个转移方向的平均值 R^2 = -3.83），投影矩阵会灾难性地崩溃，从而将特定于域的子空间几何结构建立为 LM 的通用属性。

Title: PEARL: Personalized Streaming Video Understanding Model

Authors: Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.20422
Pdf URL: https://arxiv.org/pdf/2603.20422
Copy Paste: [[2603.20422]] PEARL: Personalized Streaming Video Understanding Model(https://arxiv.org/abs/2603.20422)
Keywords: generation
Abstract: Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at this https URL.
摘要：人类对新概念的认知本质上是一个流过程：我们不断地识别新的物体或身份，并随着时间的推移更新我们的记忆。然而，当前的多模态个性化方法很大程度上局限于静态图像或离线视频。这使得连续视觉输入与即时现实世界反馈脱节，限制了它们提供未来人工智能助手所必需的实时、交互式个性化响应的能力。为了弥补这一差距，我们首先提出并正式定义了个性化流视频理解（PSVU）这一新任务。为了促进这一新方向的研究，我们推出了 PEARL-Bench，这是第一个专门为评估这一具有挑战性的环境而设计的综合基准。它评估模型在两种模式下以精确时间戳响应个性化概念的能力：（1）帧级，专注于离散帧中的特定人或物体，以及（2）新颖的视频级，专注于在连续帧中展开的个性化动作。 PEARL-Bench 包含 132 个独特的视频和 2,173 个具有精确时间戳的细粒度注释。通过自动生成和人工验证的组合管道严格确保概念多样性和注释质量。为了应对这一具有挑战性的新环境，我们进一步提出了 PEARL，这是一种即插即用、免培训的策略，可作为强大的基线。对 8 个离线和在线模型的广泛评估表明，PEARL 实现了最先进的性能。值得注意的是，当应用于 3 种不同的架构时，它带来了一致的 PSVU 改进，被证明是一种高效且稳健的策略。我们希望这项工作能够推进视觉语言模型（VLM）个性化，并激发对流式个性化人工智能助手的进一步研究。代码可从此 https URL 获取。

Title: Understanding Behavior Cloning with Action Quantization

Authors: Haoqun Cao, Tengyang Xie
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.20538
Pdf URL: https://arxiv.org/pdf/2603.20538
Copy Paste: [[2603.20538]] Understanding Behavior Cloning with Action Quantization(https://arxiv.org/abs/2603.20538)
Keywords: generative
Abstract: Behavior cloning is a fundamental paradigm in machine learning, enabling policy learning from expert demonstrations across robotics, autonomous driving, and generative models. Autoregressive models like transformer have proven remarkably effective, from large language models (LLMs) to vision-language-action systems (VLAs). However, applying autoregressive models to continuous control requires discretizing actions through quantization, a practice widely adopted yet poorly understood theoretically. This paper provides theoretical foundations for this practice. We analyze how quantization error propagates along the horizon and interacts with statistical sample complexity. We show that behavior cloning with quantized actions and log-loss achieves optimal sample complexity, matching existing lower bounds, and incurs only polynomial horizon dependence on quantization error, provided the dynamics are stable and the policy satisfies a probabilistic smoothness condition. We further characterize when different quantization schemes satisfy or violate these requirements, and propose a model-based augmentation that provably improves the error bound without requiring policy smoothness. Finally, we establish fundamental limits that jointly capture the effects of quantization error and statistical complexity.
摘要：行为克隆是机器学习的基本范例，可以从机器人、自动驾驶和生成模型的专家演示中进行策略学习。从大型语言模型 (LLM) 到视觉语言动作系统 (VLA)，像 Transformer 这样的自回归模型已被证明非常有效。然而，将自回归模型应用于连续控制需要通过量化来离散化动作，这种做法被广泛采用，但在理论上却知之甚少。本文为这一实践提供了理论基础。我们分析量化误差如何沿范围传播并与统计样本复杂性相互作用。我们表明，只要动态稳定并且策略满足概率平滑条件，使用量化动作和对数损失的行为克隆可以实现最佳样本复杂性，匹配现有的下界，并且仅产生对量化误差的多项式水平依赖性。我们进一步描述了不同的量化方案何时满足或违反这些要求，并提出了一种基于模型的增强，可以证明在不需要策略平滑性的情况下改善了误差界限。最后，我们建立了共同捕获量化误差和统计复杂性影响的基本限制。

Title: Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Authors: Alankrita Bhatt, Mukur Gupta, Germain Kolossov, Andrea Montanari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.20589
Pdf URL: https://arxiv.org/pdf/2603.20589
Copy Paste: [[2603.20589]] Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems(https://arxiv.org/abs/2603.20589)
Keywords: generative
Abstract: Generating data from discrete distributions is important for a number of application domains including text, tabular data, and genomic data. Several groups have recently used random $k$-satisfiability ($k$-SAT) as a synthetic benchmark for new generative techniques. In this paper, we show that fundamental insights from the theory of random constraint satisfaction problems have observable implications (sometime contradicting intuition) on the behavior of generative techniques on such benchmarks. More precisely, we study the problem of generating a uniformly random solution of a given (random) $k$-SAT or $k$-XORSAT formula. Among other findings, we observe that: $(i)$~Continuous diffusions outperform masked discrete diffusions; $(ii)$~Learned diffusions can match the theoretical `ideal' accuracy; $(iii)$~Smart ordering of the variables can significantly improve accuracy, although not following popular heuristics.
摘要：从离散分布生成数据对于许多应用领域（包括文本、表格数据和基因组数据）非常重要。几个小组最近使用随机 $k$ 可满足性 ($k$-SAT) 作为新生成技术的综合基准。在本文中，我们表明，随机约束满足问题理论的基本见解对生成技术在此类基准上的行为具有可观察到的影响（有时与直觉相矛盾）。更准确地说，我们研究生成给定（随机）$k$-SAT 或 $k$-XORSAT 公式的均匀随机解的问题。在其他发现中，我们观察到： $(i)$~连续扩散优于掩模离散扩散； $(ii)$~学习扩散可以匹配理论“理想”精度； $(iii)$~变量的智能排序可以显着提高准确性，尽管不遵循流行的启发式。

Title: ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

Authors: Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, Hongjie Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20644
Pdf URL: https://arxiv.org/pdf/2603.20644
Copy Paste: [[2603.20644]] ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework(https://arxiv.org/abs/2603.20644)
Keywords: generation
Abstract: Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed-source models for annotation, which prevents cost-effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open-source hierarchical multi-agent framework for end-to-end construction of large-scale, high-quality image editing datasets. Our pipeline consists of three key components: source image expansion with world-knowledge infusion, adaptive multi-agent editing instruction-image synthesis, and a task-aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit-12M, the largest open-source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine-tuning UniWorld-V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS-Bench for knowledge-infused benchmarks. These results demonstrate that open-source, agentic pipelines can approach commercial-grade data quality while retaining cost-effectiveness and scalability. Both the framework and dataset will be open-sourced.
摘要：基于指令的图像编辑已成为统一多模态模型 (UMM) 的关键功能，但在不使用昂贵的专有 API 的情况下构建大规模、多样化和高质量的编辑数据集仍然具有挑战性。以前的图像编辑数据集要么依赖闭源模型进行注释，这阻碍了经济有效的缩放，要么采用固定的合成编辑管道，但其质量和通用性有限。为了应对这些挑战，我们提出了 ScaleEditor，这是一个完全开源的分层多智能体框架，用于端到端构建大规模、高质量的图像编辑数据集。我们的管道由三个关键组件组成：具有世界知识注入的源图像扩展、自适应多智能体编辑指令图像合成以及任务感知数据质量验证机制。使用 ScaleEditor，我们策划了 ScaleEdit-12M，这是迄今为止最大的开源图像编辑数据集，涵盖跨不同真实和合成领域的 23 个任务系列。在 ScaleEdit 上微调 UniWorld-V1 和 Bagel 可以带来一致的收益，对于一般编辑基准测试，ImgEdit 的性能提高了 10.4%，GEdit 的性能提高了 35.1%，对于知识注入的基准测试，RISE 的性能提高了 150.0%，KRIS-Bench 的性能提高了 26.5%。这些结果表明，开源代理管道可以达到商业级数据质量，同时保持成本效益和可扩展性。框架和数据集都将开源。

Title: Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity

Authors: Zixuan Zhang, Kaixuan Huang, Tuo Zhao, Mengdi Wang, Minshuo Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.20645
Pdf URL: https://arxiv.org/pdf/2603.20645
Copy Paste: [[2603.20645]] Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity(https://arxiv.org/abs/2603.20645)
Keywords: generative
Abstract: Diffusion models have become a leading framework in generative modeling, yet their theoretical understanding -- especially for high-dimensional data concentrated on low-dimensional structures -- remains incomplete. This paper investigates how diffusion models learn such structured data, focusing on two key aspects: statistical complexity and influence of data geometric properties. By modeling data as samples from a smooth Riemannian manifold, our analysis reveals crucial decompositions of score functions in diffusion models under different levels of injected noise. We also highlight the interplay of manifold curvature with the structures in the score function. These analyses enable an efficient neural network approximation to the score function, built upon which we further provide statistical rates for score estimation and distribution learning. Remarkably, the obtained statistical rates are governed by the intrinsic dimension of data and the manifold curvature. These results advance the statistical foundations of diffusion models, bridging theory and practice for generative modeling on manifolds.
摘要：扩散模型已成为生成建模的领先框架，但它们的理论理解（尤其是对于集中在低维结构的高维数据）仍然不完整。本文研究了扩散模型如何学习此类结构化数据，重点关注两个关键方面：统计复杂性和数据几何属性的影响。通过将数据建模为来自平滑黎曼流形的样本，我们的分析揭示了不同注入噪声水平下扩散模型中得分函数的关键分解。我们还强调了流形曲率与得分函数中结构的相互作用。这些分析能够对分数函数进行有效的神经网络逼近，在此基础上我们进一步提供分数估计和分布学习的统计率。值得注意的是，获得的统计率受数据的固有维度和流形曲率控制。这些结果推进了扩散模型的统计基础、桥接理论和流形生成模型的实践。

Title: Exponential Family Discriminant Analysis: Generalizing LDA-Style Generative Classification to Non-Gaussian Models

Authors: Anish Lakkapragada
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.20655
Pdf URL: https://arxiv.org/pdf/2603.20655
Copy Paste: [[2603.20655]] Exponential Family Discriminant Analysis: Generalizing LDA-Style Generative Classification to Non-Gaussian Models(https://arxiv.org/abs/2603.20655)
Keywords: generative
Abstract: We introduce Exponential Family Discriminant Analysis (EFDA), a unified generative framework that extends classical Linear Discriminant Analysis (LDA) beyond the Gaussian setting to any member of the exponential family. Under the assumption that each class-conditional density belongs to a common exponential family, EFDA derives closed-form maximum-likelihood estimators for all natural parameters and yields a decision rule that is linear in the sufficient statistic, recovering LDA as a special case and capturing nonlinear decision boundaries in the original feature space. We prove that EFDA is asymptotically calibrated and statistically efficient under correct specification, and we generalise it to $K \geq 2$ classes and multivariate data. Through extensive simulation across five exponential-family distributions (Weibull, Gamma, Exponential, Poisson, Negative Binomial), EFDA matches the classification accuracy of LDA, QDA, and logistic regression while reducing Expected Calibration Error (ECE) by $2$--$6\times$, a gap that is \emph{structural}: it persists for all $n$ and across all class-imbalance levels, because misspecified models remain asymptotically miscalibrated. We further prove and empirically confirm that EFDA's log-odds estimator approaches the Cramér-Rao bound under correct specification, and is the only estimator in our comparison whose mean squared error converges to zero. Complete derivations are provided for nine distributions. Finally, we formally verify all four theoretical propositions in Lean 4, using Aristotle (Harmonic) and OpenGauss (Math, Inc.) as proof generators, with all outputs independently machine-checked by AXLE (Axiom).
摘要：我们引入指数族判别分析 (EFDA)，这是一个统一的生成框架，它将经典线性判别分析 (LDA) 扩展到高斯设置之外，扩展到指数族的任何成员。假设每个类条件密度属于一个公共指数族，EFDA 推导所有自然参数的闭式最大似然估计，并产生在充分统计量中线性的决策规则，将 LDA 恢复为特殊情况并捕获原始特征空间中的非线性决策边界。我们证明 EFDA 在正确的规范下是渐近校准和统计有效的，并且我们将其推广到 $K \geq 2$ 类和多变量数据。通过对五种指数族分布（威布尔、伽马、指数、泊松、负二项式）的广泛模拟，EFDA 匹配了 LDA、QDA 和逻辑回归的分类精度，同时将预期校准误差 (ECE) 降低了 $2$--$6\times$，这是 \emph{structural} 的差距：它对于所有 $n$ 和所有类不平衡水平都持续存在，因为错误指定的模型仍然是渐近的校准错误。我们进一步证明并凭经验证实，EFDA 的对数赔率估计器在正确的规范下接近 Cramér-Rao 界，并且是我们比较中唯一均方误差收敛于零的估计器。提供了九个分布的完整推导。最后，我们使用 Aristotle (Harmonic) 和 OpenGauss (Math, Inc.) 作为证明生成器，正式验证 Lean 4 中的所有四个理论命题，所有输出均由 AXLE (Axiom) 独立进行机器检查。

Title: MFSR: MeanFlow Distillation for One Step Real-World Image Super Resolution

Authors: Ruiqing Wang, Kai Zhang, Yuanzhi Zhu, Hanshu Yan, Shilin Lu, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20690
Pdf URL: https://arxiv.org/pdf/2603.20690
Copy Paste: [[2603.20690]] MFSR: MeanFlow Distillation for One Step Real-World Image Super Resolution(https://arxiv.org/abs/2603.20690)
Keywords: restoration, super-resolution, generative
Abstract: Diffusion- and flow-based models have advanced Real-world Image Super-Resolution (Real-ISR), but their multi-step sampling makes inference slow and hard to deploy. One-step distillation alleviates the cost, yet often degrades restoration quality and removes the option to refine with more steps. We present Mean Flows for Super-Resolution (MFSR), a new distillation framework that produces photorealistic results in a single step while still allowing an optional few-step path for further improvement. Our approach uses MeanFlow as the learning target, enabling the student to approximate the average velocity between arbitrary states of the Probability Flow ODE (PF-ODE) and effectively capture the teacher's dynamics without explicit rollouts. To better leverage pretrained generative priors, we additionally improve original MeanFlow's Classifier-Free Guidance (CFG) formulation with teacher CFG distillation strategy, which enhances restoration capability and preserves fine details. Experiments on both synthetic and real-world benchmarks demonstrate that MFSR achieves efficient, flexible, and high-quality super-resolution, delivering results on par with or even better than multi-step teachers while requiring much lower computational cost.
摘要：基于扩散和流的模型具有先进的真实世界图像超分辨率 (Real-ISR)，但其多步采样使得推理速度缓慢且难以部署。一步蒸馏降低了成本，但通常会降低修复质量，并且无法通过更多步骤进行精炼。我们提出了超分辨率平均流 (MFSR)，这是一种新的蒸馏框架，可以在单个步骤中产生逼真的结果，同时仍然允许可选的几个步骤路径进行进一步改进。我们的方法使用 MeanFlow 作为学习目标，使学生能够近似概率流 ODE (PF-ODE) 任意状态之间的平均速度，并有效捕获教师的动态，而无需显式推出。为了更好地利用预训练的生成先验，我们还使用教师 CFG 蒸馏策略改进了原始 MeanFlow 的无分类器指导 (CFG) 公式，从而增强了恢复能力并保留了精细细节。综合基准和现实基准的实验表明，MFSR 实现了高效、灵活和高质量的超分辨率，提供的结果与多步教师相当甚至更好，同时所需的计算成本要低得多。

Title: Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

Authors: Yifan Yang, Lei Zou, Wendy Jepson
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20697
Pdf URL: https://arxiv.org/pdf/2603.20697
Copy Paste: [[2603.20697]] Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models(https://arxiv.org/abs/2603.20697)
Keywords: generation, generative, quality assessment
Abstract: In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.
摘要：自然灾害发生后，快速态势感知至关重要。传统上，卫星观测被广泛用于估计损害程度。然而，它们缺乏描述特定结构故障和影响所必需的地面视角。与此同时，在时间敏感的事件期间，地面数据（例如街景图像）仍然基本上无法访问。本研究调查卫星到街景合成以弥补这一数据差距。我们引入了两种生成策略来从卫星图像合成灾后街景：视觉语言模型（VLM）引导的方法和损害敏感的专家混合（MoE）方法。我们使用提议的结构感知评估框架对通用基线（Pix2Pix、ControlNet）进行基准测试。该多层协议集成了 (1) 像素级质量评估、(2) 基于 ResNet 的语义一致性验证，以及 (3) 用于感知对齐的新型 VLM-as-a-Judge。对 300 个灾难场景的实验揭示了一个关键的现实主义——保真度权衡：虽然基于扩散的方法（例如 ControlNet）实现了高度的感知现实主义，但它们经常会产生结构细节的幻觉。定量结果表明，标准 ControlNet 实现了最高的语义准确度（0.71），而 VLM 增强模型和 MoE 模型在纹理合理性方面表现出色，但在语义清晰度方面存在困难。这项工作为可信的跨视图合成建立了基线，强调视觉逼真的生成可能仍然无法保留可靠灾害评估所需的关键结构信息。

Title: High-Quality and Efficient Turbulence Mitigation with Events

Authors: Xiaoran Zhang, Jian Ding, Yuxing Duan, Haoyue Liu, Gang Chen, Yi Chang, Luxin Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20708
Pdf URL: https://arxiv.org/pdf/2603.20708
Copy Paste: [[2603.20708]] High-Quality and Efficient Turbulence Mitigation with Events(https://arxiv.org/abs/2603.20708)
Keywords: restoration
Abstract: Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent ``event tubes'' in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively. Our code is available at: this https URL.
摘要：由于大气湍流的随机性，湍流缓解（TM）非常不适定。大多数方法依赖于传统相机记录的多帧来捕捉自然场景中的稳定模式。然而，它们不可避免地要在准确性和效率之间进行权衡：更多的帧增强恢复，但代价是更高的系统延迟和更大的数据开销。事件摄像机配备微秒时间分辨率和动态变化的高效传感，提供了突破瓶颈的机会。在这项工作中，我们提出了 EHETM，这是一种高质量且高效的 TM 方法，其灵感来自事件对连续序列中的运动建模的优越性。我们发现两个关键现象：（1）湍流引起的事件表现出与清晰图像梯度相关的明显极性交替，为恢复场景提供结构线索；（2）与湍流事件中的不规则模式相比，动态物体形成时空相干的“事件管”，为物体从湍流中解脱出来提供运动先验。基于这些见解，我们设计了两个互补的模块，分别利用极性加权梯度进行场景细化和事件管约束进行运动解耦，以很少的帧实现高质量的恢复。此外，我们构建了两个涵盖大气和热情况的真实世界事件框架湍流数据集。实验表明，EHETM 的性能优于 SOTA 方法，尤其是在具有动态对象的场景下，同时将数据开销和系统延迟分别降低了约 77.3% 和 89.5%。我们的代码位于：此 https URL。

Title: Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

Authors: Yifei Deng, Chenglong Li, Yuyang Zhang, Guyue Hu, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20721
Pdf URL: https://arxiv.org/pdf/2603.20721
Copy Paste: [[2603.20721]] Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark(https://arxiv.org/abs/2603.20721)
Keywords: generation
Abstract: Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.
摘要：文本航拍人物检索旨在根据目击者描述识别无人机捕获图像中的目标，支持智能交通和公共安全应用。与地景文本图像人物检索相比，由于视角和飞行高度的巨大变化，无人机捕获的图像经常会遭受视觉信息退化的影响，这使得语义与文本描述的对齐非常具有挑战性。为了解决这个问题，我们提出了一种新颖的跨模态模糊对齐网络，它通过模糊逻辑量化标记级可靠性，以实现精确的细粒度对齐，并将地面视图图像作为桥梁代理，以进一步缩小航空图像和文本描述之间的差距，用于文本-空中人物检索。特别是，我们设计了模糊标记对齐模块，该模块采用模糊隶属函数来动态建模标记级关联强度并抑制不可观察或噪声标记的影响。它可以减轻由于缺少视觉线索而导致的语义不一致，并显着增强标记级语义对齐的鲁棒性。此外，为了进一步缩小航拍图像和文本描述之间的差距，我们设计了一个上下文感知动态对齐模块，将地面视图代理作为文本空中对齐的桥梁，并自适应地结合直接对齐和代理辅助对齐以提高鲁棒性。此外，我们构建了一个名为 AERI-PEDES 的大规模基准数据集，通过使用思想链将文本生成分解为属性解析、初始字幕和细化，从而提高文本准确性和语义一致性。在AERI-PEDES和TBAPR上的实验证明了我们方法的优越性。

Title: Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

Authors: Zihao Wang, Yuxiang Wei, Xinpeng Zhou, Tianyu Zhang, Tao Liang, Yalong Bai, Hongzhi Zhang, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20725
Pdf URL: https://arxiv.org/pdf/2603.20725
Copy Paste: [[2603.20725]] Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation(https://arxiv.org/abs/2603.20725)
Keywords: generation, generative
Abstract: Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.
摘要：文本到图像的生成技术发展迅速，但仍难以捕捉细致入微的用户偏好。现有的方法通常依赖于多模态大语言模型来推断用户偏好，但派生的提示或潜在代码很少忠实地反映它们，从而导致个性化效果不佳。我们推出 Premier，一种用于个性化图像生成的新颖偏好调制框架。 Premier 将每个用户的偏好表示为可学习的嵌入，并引入了一个偏好适配器，将用户嵌入与文本提示融合在一起。为了实现准确和细粒度的偏好控制，融合的偏好嵌入进一步用于调节生成过程。为了增强个人偏好的明确性并改善输出和用户特定风格之间的一致性，我们引入了色散损失，强制用户嵌入之间的分离。当用户数据稀缺时，新用户被表示为在训练期间学习的现有偏好嵌入的线性组合，从而实现有效的泛化。实验表明，Premier 在相同历史长度下优于现有方法，在文本一致性、ViPer 代理指标和专家评估方面实现了更强的偏好对齐和优越性能。

Title: VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

Authors: Jun Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20731
Pdf URL: https://arxiv.org/pdf/2603.20731
Copy Paste: [[2603.20731]] VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation(https://arxiv.org/abs/2603.20731)
Keywords: quality assessment
Abstract: Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.
摘要：现有的多目标跟踪算法通常无法充分解决低质量视频中的问题，导致在现实场景中图像质量恶化时跟踪性能显着下降。这种性能下降主要是由于算法无法有效解决低质量图像中信息丢失引起的问题。为了解决低质量视频场景的挑战，受视觉语言模型的启发，我们提出了一种以视觉语义蒸馏（VSD-MOT）为指导的多对象跟踪框架。具体来说，我们引入了 CLIP 图像编码器来从图像中提取全局视觉语义信息，以补偿低质量图像中的信息丢失。然而，直接集成会极大地影响多目标跟踪算法的效率。因此，本文提出通过知识蒸馏从图像中提取视觉语义信息。该方法采用师生学习框架，以CLIP图像编码器作为教师模型。为了使学生模型获得从教师模型中提取适合多目标跟踪任务的视觉语义信息的能力，我们设计了双约束语义蒸馏方法（DCSD）。此外，为了解决低质量视频中帧质量的动态变化，我们提出了动态语义权重调节（DSWR）模块，该模块根据实时帧质量评估自适应地分配融合权重。大量的实验证明了该方法在现实世界中低质量视频场景中的有效性和优越性。同时，我们的方法可以在常规场景中保持良好的性能。

Title: Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping

Authors: Sunghyun Park, Jeongho Kim, Hyoungwoo Park, Debasmit Das, Sungrack Yun, Munawar Hayat, Jaegul Choo, Fatih Porikli, Seokeon Choi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20755
Pdf URL: https://arxiv.org/pdf/2603.20755
Copy Paste: [[2603.20755]] Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping(https://arxiv.org/abs/2603.20755)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.
摘要：Diffusion Transformers (DiT) 显着提高了文本到图像 (T2I) 的生成质量，从而实现高质量的个性化内容创建。然而，微调这些模型需要大量的计算复杂性和内存，限制了资源限制下的实际部署。为了应对这些挑战，我们提出了一种名为 DiT-BlockSkip 的内存高效微调框架，通过预先计算残差特征来集成时间步感知动态补丁采样和块跳过。我们的动态补丁采样策略根据扩散时间步长调整补丁大小，然后将裁剪后的补丁大小调整为固定的较低分辨率。这种方法减少了前向和后向内存的使用，同时允许模型以较高的时间步长捕获全局结构，并以较低的时间步长捕获细粒度细节。块跳过机制有选择地微调必要的 Transformer 块并预先计算跳过块的剩余特征，从而显着减少训练内存。为了识别个性化的重要块，我们引入了基于交叉注意力屏蔽的块选择策略。评估表明，我们的方法在定性和定量上实现了有竞争力的个性化性能，同时大幅减少了内存使用，朝着大规模扩散变压器的设备上可行性（例如智能手机、物联网设备）迈进。

Title: ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking

Authors: Kanglong Fan, Tianhe Wu, Wen Wen, Jianzhao Liu, Le Yang, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20785
Pdf URL: https://arxiv.org/pdf/2603.20785
Copy Paste: [[2603.20785]] ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking(https://arxiv.org/abs/2603.20785)
Keywords: quality assessment
Abstract: Reasoning-induced vision-language models (VLMs) advance image quality assessment (IQA) with textual reasoning, yet their scalar scores often lack sensitivity and collapse to a few values, so-called discrete collapse. We introduce ME-IQA, a plug-and-play, test-time memory-enhanced re-ranking framework. It (i) builds a memory bank and retrieves semantically and perceptually aligned neighbors using reasoning summaries, (ii) reframes the VLM as a probabilistic comparator to obtain pairwise preference probabilities and fuse this ordinal evidence with the initial score under Thurstone's Case V model, and (iii) performs gated reflection and consolidates memory to improve future decisions. This yields denser, distortion-sensitive predictions and mitigates discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over strong reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.
摘要：推理引发的视觉语言模型 (VLM) 通过文本推理推进图像质量评估 (IQA)，但它们的标量分数通常缺乏敏感性，并且会崩溃到几个值，即所谓的离散崩溃。我们介绍 ME-IQA，一个即插即用、测试时内存增强的重新排序框架。它（i）构建一个记忆库，并使用推理摘要检索语义和感知上对齐的邻居，（ii）将 VLM 重新构建为概率比较器，以获得成对偏好概率，并将此顺序证据与 Thurstone 案例 V 模型下的初始分数融合，以及（iii）执行门控反射并巩固记忆以改进未来的决策。这会产生更密集、对失真敏感的预测并减轻离散崩溃。跨多个 IQA 基准的实验表明，与强推理引发的 VLM 基线、现有非推理 IQA 方法和测试时间缩放替代方案相比，具有一致的增益。

Title: Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

Authors: Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20808
Pdf URL: https://arxiv.org/pdf/2603.20808
Copy Paste: [[2603.20808]] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models(https://arxiv.org/abs/2603.20808)
Keywords: generation
Abstract: While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
摘要：虽然多模态大语言模型 (MLLM) 擅长视觉语言任务，但其语言驱动的内部视觉基础能力训练的成本仍不清楚。在本文中，我们进行了详细的诊断分析，以揭示一个普遍存在的问题：MLLM 中的视觉表示退化。具体来说，我们发现与初始视觉特征相比，LLM 中间层的视觉表示表现出全局功能和补丁结构的退化。我们将这种现象归因于单一文本生成目标驱动的视觉牺牲，其中模型牺牲了其视觉保真度以优化答案生成。我们认为，强大的 MLLM 需要强大的跨模态推理和核心视觉能力，并提出预测正则化（PRe）来强制退化的中间特征来预测初始视觉特征，从而保持 MLLM 内部表示的固有视觉属性。大量实验证实，减轻这种视觉退化可以有效提高视觉语言性能，强调在 MLLM 中培养强大的内部视觉表征对于全面的多模态理解至关重要。

Title: TAFG-MAN: Timestep-Adaptive Frequency-Gated Latent Diffusion for Efficient and High-Quality Low-Dose CT Image Denoising

Authors: Tangtangfang Fang, Yang Jiao, Xiangjian He, Jingxi Hu, Jiaqi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20868
Pdf URL: https://arxiv.org/pdf/2603.20868
Copy Paste: [[2603.20868]] TAFG-MAN: Timestep-Adaptive Frequency-Gated Latent Diffusion for Efficient and High-Quality Low-Dose CT Image Denoising(https://arxiv.org/abs/2603.20868)
Keywords: restoration
Abstract: Low-dose computed tomography (LDCT) reduces radiation exposure but also introduces substantial noise and structural degradation, making it difficult to suppress noise without erasing subtle anatomical details. In this paper, we present TAFG-MAN, a latent diffusion framework for efficient and high-quality LDCT image denoising. The framework combines a perceptually optimized autoencoder, conditional latent diffusion restoration in a compact latent space, and a lightweight Timestep-Adaptive Frequency-Gated (TAFG) conditioning design. TAFG decomposes condition features into low- and high-frequency components, predicts timestep-adaptive gates from the current denoising feature and timestep embedding, and progressively releases high-frequency guidance in later denoising stages before cross-attention. In this way, the model relies more on stable structural guidance at early reverse steps and introduces fine details more cautiously as denoising proceeds, improving the balance between noise suppression and detail preservation. Experiments show that TAFG-MAN achieves a favorable quality-efficiency trade-off against representative baselines. Compared with its base variant without TAFG, it further improves detail preservation and perceptual quality while maintaining essentially the same inference cost, and ablation results confirm the effectiveness of the proposed conditioning mechanism.
摘要：低剂量计算机断层扫描 (LDCT) 可以减少辐射暴露，但也会引入大量噪声和结构退化，因此很难在不消除细微解剖细节的情况下抑制噪声。在本文中，我们提出了 TAFG-MAN，一种用于高效、高质量 LDCT 图像去噪的潜在扩散框架。该框架结合了感知优化的自动编码器、紧凑潜在空间中的条件潜在扩散恢复以及轻量级时间步自适应频率门控（TAFG）调节设计。 TAFG 将条件特征分解为低频和高频分量，根据当前的去噪特征和时间步嵌入来预测时间步自适应门，并在交叉注意力之前的后续去噪阶段逐步释放高频引导。这样，模型在早期反向步骤中更多地依赖稳定的结构引导，并在去噪过程中更谨慎地引入精细细节，从而改善噪声抑制和细节保留之间的平衡。实验表明，TAFG-MAN 相对于代表性基线实现了良好的质量效率权衡。与没有 TAFG 的基本变体相比，它进一步提高了细节保留和感知质量，同时保持基本相同的推理成本，并且消融结果证实了所提出的调节机制的有效性。

Title: Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections

Authors: Zhaoyi Liu, Haichuan Zhang, Ang Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20896
Pdf URL: https://arxiv.org/pdf/2603.20896
Copy Paste: [[2603.20896]] Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections(https://arxiv.org/abs/2603.20896)
Keywords: generation
Abstract: Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.
摘要：超连接（HC）将残差连接推广到多个流中，利用残差矩阵进行跨流特征混合，以丰富模型表现力。然而，无约束混合破坏了残差连接固有的恒等映射属性，导致训练不稳定。为了解决这个问题，流形约束超连接 (mHC) 及其变体通过 Sinkhorn 迭代或基于排列的参数化将这些矩阵限制为 Birkhoff 多胞形（双随机矩阵）。我们揭示了这种多面体约束的三个局限性：（1）身份退化，其中学习矩阵围绕身份崩溃并减少跨流交互，（2）表达性瓶颈，因为非负性约束阻止减法特征解缠结，以及（3）参数化效率低下，表现为不稳定的 Sinkhorn 迭代或基于排列的参数化的阶乘缩放开销。为了克服这些缺陷，我们提出了谱球约束超连接（sHC）。通过将可行集从刚性多面体几何转移到光谱范数球体，sHC 允许负条目，解锁选择性特征多样化的减法相互作用。这种转变消除了不稳定的 Sinkhorn 投影和阶乘参数化，从而实现了富有表现力的非简并残差矩阵，同时保持了训练稳定性。

Title: LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models

Authors: Amirmohammad Ziaei Bideh, Jonathan Gryak
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.20910
Pdf URL: https://arxiv.org/pdf/2603.20910
Copy Paste: [[2603.20910]] LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models(https://arxiv.org/abs/2603.20910)
Keywords: generative
Abstract: Discovering the governing equations of dynamical systems is a central problem across many scientific disciplines. As experimental data become increasingly available, automated equation discovery methods offer a promising data-driven approach to accelerate scientific discovery. Among these methods, genetic programming (GP) has been widely adopted due to its flexibility and interpretability. However, GP-based approaches often suffer from inefficient exploration of the symbolic search space, leading to slow convergence and suboptimal solutions. To address these limitations, we propose LLM-ODE, a large language model-aided model discovery framework that guides symbolic evolution using patterns extracted from elite candidate equations. By leveraging the generative prior of large language models, LLM-ODE produces more informed search trajectories while preserving the exploratory strengths of evolutionary algorithms. Empirical results on 91 dynamical systems show that LLM-ODE variants consistently outperform classical GP methods in terms of search efficiency and Pareto-front quality. Overall, our results demonstrate that LLM-ODE improves both efficiency and accuracy over traditional GP-based discovery and offers greater scalability to higher-dimensional systems compared to linear and Transformer-only model discovery methods.
摘要：发现动力系统的控制方程是许多科学学科的核心问题。随着实验数据变得越来越可用，自动方程发现方法提供了一种有前途的数据驱动方法来加速科学发现。在这些方法中，遗传编程（GP）由于其灵活性和可解释性而被广泛采用。然而，基于 GP 的方法通常会遇到符号搜索空间探索效率低下的问题，从而导致收敛速度慢和解决方案次优。为了解决这些限制，我们提出了 LLM-ODE，这是一种大型语言模型辅助模型发现框架，它使用从精英候选方程中提取的模式来指导符号演化。通过利用大型语言模型的生成先验，LLM-ODE 可以生成更明智的搜索轨迹，同时保留进化算法的探索优势。 91 个动力系统的实证结果表明，LLM-ODE 变体在搜索效率和 Pareto 前沿质量方面始终优于经典 GP 方法。总体而言，我们的结果表明，与传统的基于 GP 的发现方法相比，LLM-ODE 提高了效率和准确性，并且与线性和仅 Transformer 的模型发现方法相比，为高维系统提供了更大的可扩展性。

Title: Discriminative Representation Learning for Clinical Prediction

Authors: Yang Zhang, Li Fan, Samuel Lawrence, Shi Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.20921
Pdf URL: https://arxiv.org/pdf/2603.20921
Copy Paste: [[2603.20921]] Discriminative Representation Learning for Clinical Prediction(https://arxiv.org/abs/2603.20921)
Keywords: generative
Abstract: Foundation models in healthcare have largely adopted self supervised pretraining objectives inherited from natural language processing and computer vision, emphasizing reconstruction and large scale representation learning prior to downstream adaptation. We revisit this paradigm in outcome centric clinical prediction settings and argue that, when high quality supervision is available, direct outcome alignment may provide a stronger inductive bias than generative pretraining. We propose a supervised deep learning framework that explicitly shapes representation geometry by maximizing inter class separation relative to within class variance, thereby concentrating model capacity along clinically meaningful axes. Across multiple longitudinal electronic health record tasks, including mortality and readmission prediction, our approach consistently outperforms masked, autoregressive, and contrastive pretraining baselines under matched model capacity. The proposed method improves discrimination, calibration, and sample efficiency, while simplifying the training pipeline to a single stage optimization. These findings suggest that in low entropy, outcome driven healthcare domains, supervision can act as the statistically optimal driver of representation learning, challenging the assumption that large scale self supervised pretraining is a prerequisite for strong clinical performance.
摘要：医疗保健领域的基础模型在很大程度上采用了从自然语言处理和计算机视觉继承而来的自监督预训练目标，强调在下游适应之前进行重建和大规模表示学习。我们在以结果为中心的临床预测环境中重新审视这种范式，并认为，当可以进行高质量监督时，直接结果对齐可能会提供比生成预训练更强的归纳偏差。我们提出了一种有监督的深度学习框架，该框架通过最大化相对于类内方差的类间分离来显式地塑造表示几何形状，从而沿着临床有意义的轴集中模型容量。在多个纵向电子健康记录任务中，包括死亡率和再入院预测，我们的方法在匹配模型容量下始终优于屏蔽、自回归和对比预训练基线。所提出的方法提高了辨别力、校准和样本效率，同时将训练流程简化为单阶段优化。这些发现表明，在低熵、结果驱动的医疗保健领域，监督可以作为表征学习的统计最佳驱动因素，挑战大规模自我监督预训练是强大临床表现的先决条件的假设。

Title: GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies

Authors: Uzair Shah, Marco Agus, Mahmoud Gamal, Mahmood Alzubaidi, Corrado Cali, Pierre J. Magistretti, Abdesselam Bouzerdoum, Mowafa Househ
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.20970
Pdf URL: https://arxiv.org/pdf/2603.20970
Copy Paste: [[2603.20970]] GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies(https://arxiv.org/abs/2603.20970)
Keywords: generative
Abstract: Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: this https URL
摘要：神经元形态编码有关电路功能、发育和疾病的关键信息，但当前的方法单独分析拓扑或图形结构。我们引入了 GraPHFormer，这是一种多模式架构，它通过 CLIP 式的对比学习来统一这些互补的观点。我们的视觉分支通过 DINOv2-ViT-S 处理一种新颖的三通道余辉图像，编码未加权、余辉加权和半径加权拓扑密度。同时，TreeLSTM 编码器从骨架图中捕获几何和径向属性。两者都投影到一个共享嵌入空间，该嵌入空间经过对称 InfoNCE 损失训练，并通过保留拓扑语义的持久空间转换进行了增强。 GraPHFormer 在涵盖自我监督和监督设置的六个基准（BIL-6、ACT-4、JML-4、N7、M1-Cell、M1-REG）上进行评估，在五个基准上实现了最先进的性能，显着优于仅拓扑、仅图形和形态测量基准。我们通过区分皮质区域和物种的神经胶质形态并检测发育和退化过程的特征来展示实用性。代码：这个https URL

Title: Interpreting the Synchronization Gap: The Hidden Mechanism Inside Diffusion Transformers

Authors: Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen, Viola Zixin Zhao
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.stat-mech
Abstract URL: https://arxiv.org/abs/2603.20987
Pdf URL: https://arxiv.org/pdf/2603.20987
Copy Paste: [[2603.20987]] Interpreting the Synchronization Gap: The Hidden Mechanism Inside Diffusion Transformers(https://arxiv.org/abs/2603.20987)
Keywords: generative
Abstract: Recent theoretical models of diffusion processes, conceptualized as coupled Ornstein-Uhlenbeck systems, predict a hierarchy of interaction timescales, and consequently, the existence of a synchronization gap between modes that commit at different stages of the reverse process. However, because these predictions rely on continuous time and analytically tractable score functions, it remains unclear how this phenomenology manifests in the deep, discrete architectures deployed in practice. In this work, we investigate how the synchronization gap is mechanistically realized within pretrained Diffusion Transformers (DiTs). We construct an explicit architectural realization of replica coupling by embedding two generative trajectories into a joint token sequence, modulated by a symmetric cross attention gate with variable coupling strength g. Through a linearized analysis of the attention difference, we show that the replica interaction decomposes mechanistically. We empirically validate our theoretical framework on a pretrained DiT-XL/2 model by tracking commitment and per layer internal mode energies. Our results reveal that: (1) the synchronization gap is an intrinsic architectural property of DiTs that persists even when external coupling is turned off; (2) as predicted by our spatial routing bounds, the gap completely collapses under strong coupling; (3) the gap is strictly depth localized, emerging sharply only within the final layers of the Transformer; and (4) global, low frequency structures consistently commit before local, high frequency details. Ultimately, our findings provide a mechanistic interpretation of how Diffusion Transformers resolve generative ambiguity, isolating speciation transitions to the terminal layers of the network.
摘要：最近的扩散过程理论模型，概念化为耦合的奥恩斯坦-乌伦贝克系统，预测了相互作用时间尺度的层次结构，因此，在逆过程的不同阶段提交的模式之间存在同步间隙。然而，由于这些预测依赖于连续时间和分析上易于处理的评分函数，因此目前尚不清楚这种现象如何在实践中部署的深层离散架构中体现。在这项工作中，我们研究了如何在预训练的扩散变压器（DiT）中机械地实现同步间隙。我们通过将两个生成轨迹嵌入到联合令牌序列中，并通过具有可变耦合强度 g 的对称交叉注意门进行调制，构建了副本耦合的显式架构实现。通过对注意力差异的线性分析，我们表明复制交互是机械分解的。我们通过跟踪承诺和每层内模能量，在预训练的 DiT-XL/2 模型上实证验证我们的理论框架。我们的结果表明：（1）同步间隙是 DiT 的内在架构属性，即使外部耦合关闭，它仍然存在；（2）正如我们的空间路由边界所预测的，在强耦合下，间隙完全崩溃； (3) 间隙是严格深度局部化的，仅在 Transformer 的最后几层内急剧出现； (4)全局的低频结构始终先于局部的高频细节提交。最终，我们的研究结果为扩散变压器如何解决生成模糊性、隔离网络终端层的物种形成转变提供了机械解释。

Title: LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction

Authors: Shuwei Huang, Shizhuo Liu, Zijun Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21045
Pdf URL: https://arxiv.org/pdf/2603.21045
Copy Paste: [[2603.21045]] LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction(https://arxiv.org/abs/2603.21045)
Keywords: super-resolution
Abstract: Diffusion-based image super-resolution (SR), which aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) observations, faces a fundamental trade-off between inference efficiency and reconstruction quality. The state-of-the-art residual-shifting diffusion framework achieves efficient 4-step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior-enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed-form analytical solution of the optimal intermediate noise for the residual-shifting diffusion paradigm, and accordingly design an LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework's core efficient residual-shifting mechanism. We further mitigate initial bias with a high-quality pre-upsampling network to optimize the diffusion starting point. With a compact 4-step trajectory, LPNSR can be optimized in an end-to-end manner. Extensive experiments demonstrate that LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets, without relying on any large-scale text-to-image priors. The source code of our method can be found at this https URL.
摘要：基于扩散的图像超分辨率（SR）旨在从相应的低分辨率（LR）观测中重建高分辨率（HR）图像，面临推理效率和重建质量之间的基本权衡。最先进的残差移位扩散框架实现了高效的四步推理，但在紧凑的采样轨迹中性能严重下降。这主要归因于两个核心限制：中间步骤中无约束随机高斯噪声固有的次优性，这会导致误差累积和LR先验指导不足，以及朴素双三次上采样引起的初始化偏差。在本文中，我们提出了 LPNSR，一种先验增强的高效扩散框架来解决这些问题。我们首先在数学上推导了残差移动扩散范式的最佳中间噪声的封闭式解析解，并相应地设计了一个LR引导的多输入感知噪声预测器来替换随机高斯噪声，将LR结构先验嵌入到逆过程中，同时完全保留了框架的核心高效残差移动机制。我们通过高质量的预上采样网络进一步减轻初始偏差，以优化扩散起点。凭借紧凑的 4 步轨迹，LPNSR 可以以端到端的方式进行优化。大量实验表明，LPNSR 在合成数据集和真实数据集上都实现了最先进的感知性能，而不依赖于任何大规模文本到图像先验。我们方法的源代码可以在此 https URL 中找到。

Title: Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting

Authors: Hwasik Jeong, Seungryong Lee, Gyeongjin Kang, Seungkwon Yang, Xiangyu Sun, Seungtae Nam, Eunbyung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21064
Pdf URL: https://arxiv.org/pdf/2603.21064
Copy Paste: [[2603.21064]] Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting(https://arxiv.org/abs/2603.21064)
Keywords: generation
Abstract: Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
摘要：无位姿前馈 3D 高斯泼溅 (3DGS) 为快速 3D 建模开辟了新领域，能够在单次前向传递中从未经校准的多视图图像生成高质量高斯表示。该领域的主要方法采用统一的整体架构，通常构建在以几何为中心的 3D 基础模型上，以在单个网络中联合估计相机姿态并合成 3DGS 表示。虽然在架构上进行了简化，但这种“一体化”设计对于高保真 3DGS 生成来说可能不是最佳选择，因为它们将几何推理和外观建模纠缠在共享表示中。在这项工作中，我们引入了 2Xplat，这是一种基于两位专家设计的无姿态前馈 3DGS 框架，该框架将几何估计与高斯生成明确分开。专门的几何专家首先预测相机姿势，然后将其明确传递给合成 3D 高斯的强大外观专家。尽管其概念简单，并且在之前的工作中很大程度上没有得到充分探索，但所提出的方法被证明是非常有效的。在不到 5K 次训练迭代中，所提出的两位专家管道大大优于先前的无姿势前馈 3DGS 方法，并实现了与最先进的姿势方法相当的性能。这些结果挑战了流行的统一范式，并表明了模块化设计原则在复杂 3D 几何估计和外观合成任务中的潜在优势。

Title: Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models

Authors: Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21085
Pdf URL: https://arxiv.org/pdf/2603.21085
Copy Paste: [[2603.21085]] Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models(https://arxiv.org/abs/2603.21085)
Keywords: generation
Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
摘要：潜在扩散模型已成为高保真和高效图像生成的主导框架，因为它们能够学习紧凑潜在空间中的扩散过程。然而，虽然之前的研究主要集中在潜在空间的重建精度和语义对齐上，但我们观察到另一个关键因素，即对采样扰动的鲁棒性，在确定生成质量方面也起着至关重要的作用。通过实证和理论分析，我们表明，潜在扩散模型中常用的基于 $\beta$-VAE 的分词器往往会产生过于紧凑的潜在流形，这些流形对扩散采样期间的随机扰动高度敏感，从而导致视觉退化。为了解决这个问题，我们提出了一种简单而有效的解决方案，该解决方案构建了一个对采样扰动具有鲁棒性的潜在空间，同时保持了强大的重建保真度。这是通过引入方差扩展损失来实现的，该损失可以抵消方差崩溃，并利用重建和方差扩展之间的对抗性相互作用来实现自适应平衡，从而保持重建精度，同时提高随机采样的鲁棒性。大量的实验表明，我们的方法始终如一地提高了不同潜在扩散架构的生成质量，证实潜在空间的鲁棒性是稳定和忠实扩散采样所缺少的关键要素。

Title: MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

Authors: Pengxiang Cai, Mengyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21136
Pdf URL: https://arxiv.org/pdf/2603.21136
Copy Paste: [[2603.21136]] MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics(https://arxiv.org/abs/2603.21136)
Keywords: generation
Abstract: Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
摘要：基于扩散的文本到图像的生成已经取得了显着的进步，但定制具有多个不同主题的场景，同时保持对其交互的细粒度控制仍然具有挑战性。现有的方法通常难以提供对主题之间的构图结构和精确空间关系的明确的用户定义控制。为了解决这个问题，我们引入了 MS-CustomNet，这是一种用于多主题定制的新颖框架。 MS-CustomNet 允许对多个用户提供的对象进行零样本集成，最重要的是，它使用户能够在生成的图像中明确定义这些层次结构排列和空间放置。我们的方法确保个体受试者身份的保存，同时学习和制定这些用户指定的受试者间组合。我们还提供了源自 COCO 的 MSI 数据集，以促进对此类复杂的多主题作品的训练。 MS-CustomNet 提供对多主体图像生成的增强、细粒度控制。我们的方法在多主体定制任务中的身份保留方面获得了 0.61 的 DINO-I 分数，在位置控制方面获得了 0.94 的 YOLO-L 分数，展示了其通过精确、用户引导的多主体构图和空间控制生成高保真图像的卓越能力。

Title: Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

Authors: Wenjin Hou, Xiaoxiao Sun, Hehe Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21138
Pdf URL: https://arxiv.org/pdf/2603.21138
Copy Paste: [[2603.21138]] Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues(https://arxiv.org/abs/2603.21138)
Keywords: generation, generative
Abstract: Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.
摘要：零样本学习（ZSL）的最新进展证明了生成模型的潜力。通常，生成式 ZSL 会合成以语义原型为条件的视觉特征，以对未见过的类的数据分布进行建模，然后根据合成数据训练分类器。然而，合成的特征通常与任务无关，导致性能下降。此外，对于语义相似但视觉上不同的类，仅从语义原型推断忠实分布是不够的。为了解决这些问题并推进 ZSL，我们提出了 RLVC，这是一种结果奖励强化学习 RL 框架，具有生成 ZSL 的视觉提示。强化学习的核心是使生成模型能够自我进化，从而隐式增强其生成能力。特别是，RLVC 使用基于结果的奖励更新生成模型，鼓励任务相关特征的综合。此外，我们引入了按类别的视觉提示，（i）将合成特征与视觉原型对齐，以及（ii）稳定强化学习训练更新。对于训练过程，我们提出了一种新颖的冷启动策略。对三个流行的 ZSL 基准的综合实验和分析表明，RLVC 实现了最先进的结果，增益提高了 4.7%。

Title: Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images

Authors: Jiatong Xia, Lingqiao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21166
Pdf URL: https://arxiv.org/pdf/2603.21166
Copy Paste: [[2603.21166]] Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images(https://arxiv.org/abs/2603.21166)
Keywords: generation, generative
Abstract: We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: this https URL
摘要：我们引入了一种新颖的免训练系统，用于从一组稀疏的未设置 RGB 图像中重建、理解和渲染 3D 室内场景。与需要密集视图和每个场景优化的传统辐射场方法不同，我们的流程无需任何训练或姿势预处理即可实现高保真度结果。该系统集成了三个关键创新：（1）强大的点云重建模块，使用基于扭曲的异常去除策略过滤不可靠的几何图形； (2) 扭曲引导的 2D 到 3D 实例提升机制，将 2D 分割掩模传播为一致的、实例感知的 3D 表示； (3) 一种新颖的渲染方法，将点云投影到新视图中，并使用 3D 感知扩散模型细化渲染。我们的方法利用扩散的生成能力来补偿缺失的几何形状并增强真实感，特别是在稀疏输入条件下。我们进一步证明，通过仅修改点云，可以在我们的管道中自然地支持对象级场景编辑（例如实例删除），从而无需重新训练即可合成一致的编辑视图。我们的结果为高效、可编辑的 3D 内容生成确立了新方向，无需依赖特定于场景的优化。项目页面：此 https URL

Title: GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

Authors: Zifeng Zhu, Jiaming Han, Jiaxiang Zhao, Minnan Luo, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21176
Pdf URL: https://arxiv.org/pdf/2603.21176
Copy Paste: [[2603.21176]] GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing(https://arxiv.org/abs/2603.21176)
Keywords: generation
Abstract: While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.
摘要：虽然扩散大型语言模型 (DLLM) 在多模态生成方面表现出了卓越的能力，但执行精确、免训练的图像编辑仍然是一个开放的挑战。与连续扩散模型不同，DLLM 固有的离散标记化阻碍了标准噪声反演技术的应用，通常导致编辑过程中的结构退化。在本文中，我们介绍了 GIDE（DLLM 图像编辑的接地反转），这是一个旨在弥补这一差距的统一框架。 GIDE 采用了一种新颖的离散噪声反转机制，可以准确捕获离散标记空间内的潜在噪声模式，确保高保真度重建。然后，我们将编辑流程分解为基础、反转和细化阶段。这种设计使得 GIDE 支持各种编辑指令（文本、点和框）和操作，同时严格保留未编辑的背景。此外，为了克服现有单步评估协议的局限性，我们引入了 GIDE-Bench，这是一个严格的基准，包含由不同多模式输入引导的 805 个合成编辑场景。 GIDE-Bench 上的大量实验表明，GIDE 显着优于之前的免训练方法，语义正确性提高了 51.83%，感知质量提高了 50.39%。对 ImgEdit-Bench 的其他评估证实了其广泛的适用性，证明了在经过训练的基线上的一致增益，并产生了与领先模型相当的照片级真实感一致性。

Title: Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis

Authors: Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21213
Pdf URL: https://arxiv.org/pdf/2603.21213
Copy Paste: [[2603.21213]] Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis(https://arxiv.org/abs/2603.21213)
Keywords: generation
Abstract: Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.
摘要：反事实图像生成可以实现受控数据增强、偏差缓解和疾病建模。然而，由外部分类器或回归器指导的现有方法仅限于主题级因素（例如年龄），并且无法产生局部结构变化，通常会导致全局伪影。已经探索了使用分割掩码的像素级指导，但需要用户定义的反事实掩码，这是乏味且不切实际的。分段引导的反事实微调（Seg-CFT）通过使用分段导出的测量来监督特定结构的变量来解决这个问题，但它仍然仅限于全局干预。我们提出了 Positional Seg-CFT，它将每个结构细分为区域片段，并导出每个区域的独立测量值，从而实现空间局部化和解剖学连贯的反事实。冠状动脉 CT 血管造影实验表明，Pos-Seg-CFT 产生真实的、区域特异性的修改，为疾病进展建模提供更精细的空间控制。

Title: Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders

Authors: Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.21236
Pdf URL: https://arxiv.org/pdf/2603.21236
Copy Paste: [[2603.21236]] Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders(https://arxiv.org/abs/2603.21236)
Keywords: generation, generative
Abstract: Although mechanism-based interpretability has generated an abundance of insight for discriminative network analysis, generative models are less understood -- particularly outside of image-related applications. We investigate how much of the causal circuitry found within image-related variational autoencoders (VAEs) will generalize to tabular data, as VAEs are increasingly used for imputation, anomaly detection, and synthetic data generation. In addition to extending a four-level causal intervention framework to four tabular and one image benchmark across five different VAE architectures (with 75 individual training runs per architecture and three random seed values for each run), this paper introduces three new techniques: posterior-calibration of Causal Effect Strength (CES), path-specific activation patching, and Feature-Group Disentanglement (FGD). The results from our experiments demonstrate that: (i) Tabular VAEs have circuits with modularity that is approximately 50% lower than their image counterparts. (ii) $\beta$-VAE experiences nearly complete collapse in CES scores when applied to heterogeneous tabular features (0.043 CES score for tabular data compared to 0.133 CES score for images), which can be directly attributed to reconstruction quality degradation (r = -0.886 correlation coefficient between CES and MSE). (iii) CES successfully captures nine of eleven statistically significant architecture differences using Holm--Šidák corrections. (iv) Interventions with high specificity predict the highest downstream AUC values (r = 0.460, p < .001). This study challenges the common assumption that architectural guidance from image-related studies can be transferred to tabular datasets.
摘要：尽管基于机制的可解释性已经为判别性网络分析产生了丰富的见解，但生成模型却鲜为人知——尤其是在图像相关应用之外。我们研究了图像相关变分自动编码器 (VAE) 中发现的因果电路有多少将推广到表格数据，因为 VAE 越来越多地用于插补、异常检测和合成数据生成。除了将四级因果干预框架扩展到跨五种不同 VAE 架构的四个表格和一个图像基准（每个架构有 75 次单独训练运行，每次运行有 3 个随机种子值）之外，本文还介绍了三种新技术：因果效应强度 (CES) 的后校准、特定于路径的激活修补和特征组解缠 (FGD)。我们的实验结果表明：(i) 表格 VAE 的电路模块化程度比图像电路低约 50%。 (ii) 当应用于异构表格特征时，$\beta$-VAE 的 CES 分数几乎完全崩溃（表格数据的 CES 分数为 0.043，而图像的 CES 分数为 0.133），这可以直接归因于重建质量下降（CES 和 MSE 之间的相关系数为 r = -0.886）。 (iii) CES 使用 Holm-Šidák 校正成功捕获了 11 个统计上显着的架构差异中的 9 个。 (iv) 具有高特异性的干预措施可预测最高的下游 AUC 值（r = 0.460，p < .001）。这项研究挑战了图像相关研究的架构指导可以转移到表格数据集的普遍假设。

Title: Amortized Variational Inference for Logistic Regression with Missing Covariates

Authors: M. Cherifi, Aude Sportisse, Xujia Zhu, Mohammed Nabil El Korso, A. Mesloub
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2603.21244
Pdf URL: https://arxiv.org/pdf/2603.21244
Copy Paste: [[2603.21244]] Amortized Variational Inference for Logistic Regression with Missing Covariates(https://arxiv.org/abs/2603.21244)
Keywords: generative
Abstract: Missing covariate data pose a significant challenge to statistical inference and machine learning, particularly for classification tasks like logistic regression. Classical iterative approaches (EM, multiple imputation) are often computationally intensive, sensitive to high missingness rates, and limited in uncertainty propagation. Recent deep generative models based on VAEs show promise but rely on complex latent representations. We propose Amortized Variational Inference for Logistic Regression (AV-LR), a unified end-to-end framework for binary logistic regression with missing covariates. AV-LR integrates a probabilistic generative model with a simple amortized inference network, trained jointly by maximizing the evidence lower bound. Unlike competing methods, AV-LR performs inference directly in the space of missing data without additional latent variables, using a single inference network and a linear layer that jointly estimate regression parameters and the missingness mechanism. AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms, with significantly lower computational cost. It naturally extends to missing-not-at-random settings by explicitly modeling the missingness mechanism. Empirical results on synthetic and real-world datasets confirm its effectiveness and efficiency across various missing-data scenarios.
摘要：缺失的协变量数据对统计推断和机器学习构成了重大挑战，特别是对于逻辑回归等分类任务。经典的迭代方法（EM，多重插补）通常计算量大，对高缺失率敏感，并且不确定性传播有限。最近基于 VAE 的深度生成模型显示出前景，但依赖于复杂的潜在表示。我们提出了逻辑回归分摊变分推理（AV-LR），这是一种用于缺少协变量的二元逻辑回归的统一端到端框架。 AV-LR 将概率生成模型与简单的摊销推理网络集成在一起，通过最大化证据下限进行联合训练。与竞争方法不同，AV-LR 直接在缺失数据的空间中执行推理，无需额外的潜在变量，使用单个推理网络和线性层共同估计回归参数和缺失机制。 AV-LR 的估计精度与最先进的类 EM 算法相当或更好，并且计算成本显着降低。通过对缺失机制进行显式建模，它自然地扩展到非随机缺失设置。合成数据集和真实数据集的实证结果证实了其在各种缺失数据场景中的有效性和效率。

Title: Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation

Authors: Soudeep Ghoshal, Sandipan Chakraborty, Pradipto Chowdhury, Himanshu Buckchash
Subjects: cs.LG, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2603.21282
Pdf URL: https://arxiv.org/pdf/2603.21282
Copy Paste: [[2603.21282]] Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation(https://arxiv.org/abs/2603.21282)
Keywords: generation
Abstract: Machine learning techniques, such as Transformers and Long Short-Term Memory (LSTM) networks, play a crucial role in Symbolic Music Generation (SMG). Existing literature indicates a difference between LSTMs and Transformers regarding their ability to model local melodic continuity versus maintaining global structural coherence. However, their specific properties within the context of SMG have not been systematically studied. This paper addresses this gap by providing a fine-grained comparative analysis of LSTMs versus Transformers for SMG, examining local and global properties in detail using 17 musical quality metrics on the Deutschl dataset. We find that LSTM networks excel at capturing local patterns but fail to preserve long-range dependencies, while Transformers model global structure effectively but tend to produce irregular phrasing. Based on this analysis and leveraging their respective strengths, we propose a Hybrid architecture combining a Transformer Encoder with an LSTM Decoder and evaluate it against both baselines. We evaluated 1,000 generated melodies from each of the three architectures on the Deutschl dataset. The results show that the hybrid method achieves better local and global continuity and coherence compared to the baselines. Our work highlights the key characteristics of these models and demonstrates how their properties can be leveraged to design superior models. We also supported the experiments with ablation studies and human perceptual evaluations, which statistically support the findings and provide robust validation for this work.
摘要：机器学习技术，例如 Transformer 和长短期记忆 (LSTM) 网络，在符号音乐生成 (SMG) 中发挥着至关重要的作用。现有文献表明 LSTM 和 Transformer 之间在建模局部旋律连续性与保持全局结构连贯性的能力方面存在差异。然而，它们在 SMG 背景下的具体特性尚未得到系统研究。本文通过对 SMG 的 LSTM 与 Transformer 进行细粒度比较分析，使用 Deutschl 数据集上的 17 个音乐质量指标详细检查本地和全局属性，从而解决了这一差距。我们发现 LSTM 网络擅长捕获局部模式，但无法保留远程依赖性，而 Transformers 可以有效地模拟全局结构，但往往会产生不规则的措辞。基于此分析并利用各自的优势，我们提出了一种将 Transformer 编码器与 LSTM 解码器相结合的混合架构，并根据两个基线对其进行评估。我们在 Deutschl 数据集上评估了三种架构中每种架构生成的 1,000 首旋律。结果表明，与基线相比，混合方法实现了更好的局部和全局连续性和一致性。我们的工作突出了这些模型的关键特征，并展示了如何利用它们的特性来设计卓越的模型。我们还支持消融研究和人类感知评估的实验，这些实验在统计上支持了研究结果，并为这项工作提供了强有力的验证。

Title: Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting

Authors: Yuntian Bo, Yazhou Zhu, Piotr Koniusz, Haofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21287
Pdf URL: https://arxiv.org/pdf/2603.21287
Copy Paste: [[2603.21287]] Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting(https://arxiv.org/abs/2603.21287)
Keywords: generation
Abstract: Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM's over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at this https URL.
摘要：传统的少镜头医学图像分割（FSMIS）方法面临性能瓶颈，阻碍了更广泛的临床适用性。尽管Segment Anything Model（SAM）展现出强大的类别无关分割能力，但其直接应用于医学图像往往会因解剖边界不明确而导致过度分割。在本文中，我们将基于 SAM 的 FSMIS 重新表述为提示定位任务，并提出了 FoB（关注背景），这是一种以背景为中心的提示生成器，可提供准确的背景提示以限制 SAM 的过度分割。具体来说，FoB 通过与类别无关的支持背景提示生成并将其直接本地化在查询图像中，弥合了分割和提示本地化之间的差距。为了解决新类别快速定位的挑战，FoB 对丰富的上下文信息进行建模以捕获前景-背景空间依赖性。此外，受医学图像中背景提示的固有结构模式的启发，FoB 将此结构建模为逐步完善背景提示预测的约束。在三个不同的医学图像数据集上进行的实验表明，FoB 大幅优于其他基线，在 FSMIS 上实现了最先进的性能，并表现出强大的跨域泛化能力。我们的代码可以在这个 https URL 上找到。

Title: Text-Image Conditioned 3D Generation

Authors: Jiazhong Cen, Jiemin Fang, Sikuang Li, Guanjun Wu, Chen Yang, Taoran Yi, Zanwei Zhou, Zhikuan Bao, Lingxi Xie, Wei Shen, Qi Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21295
Pdf URL: https://arxiv.org/pdf/2603.21295
Copy Paste: [[2603.21295]] Text-Image Conditioned 3D Generation(https://arxiv.org/abs/2603.21295)
Keywords: generation, generative
Abstract: High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: this https URL
摘要：高质量的 3D 资产对于 VR/AR、工业设计和娱乐至关重要，这激发了人们对根据用户提示创建 3D 内容的生成模型越来越感兴趣。然而，大多数现有的 3D 生成器依赖于单一的调节模式：图像调节模型通过利用像素对齐线索实现高视觉保真度，但当输入视图有限或模糊时会出现视点偏差，而文本调节模型提供广泛的语义指导，但缺乏低级视觉细节。这限制了用户表达意图的方式，并提出了一个自然的问题：这两种模式能否结合起来以实现更灵活和忠实的 3D 生成？我们的诊断研究表明，即使是文本和图像条件预测的简单后期融合也优于单模态模型，揭示了强大的跨模态互补性。因此，我们将文本图像条件 3D 生成形式化，这需要对视觉示例和文本规范进行联合推理。为了解决这个任务，我们引入了 TIGON，一种极简的双分支基线，具有独立的图像和文本条件主干以及轻量级跨模态融合。大量实验表明，文本-图像调节比单模态方法持续改进，凸显互补视觉-语言指导是未来 3D 生成研究的一个有希望的方向。项目页面：此 https URL

Title: Identity-Consistent Video Generation under Large Facial-Angle Variations

Authors: Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21299
Pdf URL: https://arxiv.org/pdf/2603.21299
Copy Paste: [[2603.21299]] Identity-Consistent Video Generation under Large Facial-Angle Variations(https://arxiv.org/abs/2603.21299)
Keywords: generation
Abstract: Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.
摘要：单视图参考视频方法通常难以在较大的面部角度变化下保持身份一致性。这种限制自然会促使多视图面部参考的结合。然而，简单地引入额外的参考图像会加剧 \textit{copy-paste} 问题，特别是 \textbf{\textit{view-dependent copy-paste}} 伪像，这会降低面部运动的自然度。尽管交叉配对数据可以缓解这个问题，但收集此类数据的成本很高。为了平衡一致性和自然性，我们提出了 $\mathrm{Mv}^2\mathrm{ID}$，一个成对监督下的多视图条件框架。我们引入了区域屏蔽训练策略，以防止捷径学习，并通过鼓励模型聚合跨视图的互补身份线索来提取基本的身份特征。此外，我们设计了一种参考解耦 RoPE 机制，该机制为视频和调节令牌分配不同的位置编码，以便更好地对其异构属性进行建模。此外，我们构建了一个具有不同面部角度变化的大规模数据集，并提出了用于身份一致性和运动自然度的专用评估指标。大量的实验表明，我们的方法在保持运动自然性的同时显着提高了身份一致性，优于使用交叉配对数据训练的现有方法。

Title: KHMP: Frequency-Domain Kalman Refinement for High-Fidelity Human Motion Prediction

Authors: Wenhan Wu, Zhishuai Guo, Chen Chen, Srijan Das, Hongfei Xue, Pu Wang, Aidong Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21327
Pdf URL: https://arxiv.org/pdf/2603.21327
Copy Paste: [[2603.21327]] KHMP: Frequency-Domain Kalman Refinement for High-Fidelity Human Motion Prediction(https://arxiv.org/abs/2603.21327)
Keywords: generative
Abstract: Stochastic human motion prediction aims to generate diverse, plausible futures from observed sequences. Despite advances in generative modeling, existing methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities. To address these challenges, we introduce KHMP, a novel framework featuring an adaptiveKalman filter applied in the DCT domain to generate high-fidelity human motion predictions. By treating high-frequency DCT coefficients as a frequency-indexed noisy signal, the Kalman filter recursively suppresses noise while preserving motion details. Notably, its noise parameters are dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR), enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This refinement is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles into the generative model. Together, these innovations establish a new paradigm integrating adaptive signal processing with physics-informed learning. Experiments on the Human3.6M and HumanEva-I datasets demonstrate that KHMP achieves state-of-the-art accuracy, effectively mitigating jitter artifacts to produce smooth and physically plausible motions.
摘要：随机人体运动预测旨在从观察到的序列中生成多样化的、合理的未来。尽管生成建模取得了进步，但现有方法经常会产生因高频抖动和时间不连续性而损坏的预测。为了解决这些挑战，我们引入了 KHMP，这是一种新颖的框架，其特点是在 DCT 域中应用自适应卡尔曼滤波器来生成高保真人体运动预测。通过将高频 DCT 系数视为频率索引噪声信号，卡尔曼滤波器递归地抑制噪声，同时保留运动细节。值得注意的是，它的噪声参数是根据估计的信噪比 (SNR) 动态调整的，从而能够对抖动预测进行积极的去噪，并对干净的运动进行保守的过滤。这种改进得到了训练时物理约束（时间平滑度和关节角度限制）的补充，这些约束将生物力学原理编码到生成模型中。这些创新共同建立了一种将自适应信号处理与物理知识学习相结合的新范式。 Human3.6M 和 HumanEva-I 数据集上的实验表明，KHMP 实现了最先进的精度，有效减轻了抖动伪影，以产生平滑且物理上合理的运动。

Title: EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization

Authors: Haolan Xu, Keli Cheng, Lei Wang, Ning Bi, Xiaoming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21332
Pdf URL: https://arxiv.org/pdf/2603.21332
Copy Paste: [[2603.21332]] EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization(https://arxiv.org/abs/2603.21332)
Keywords: generation
Abstract: Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.
摘要：音频驱动的 3D 头部说话合成技术凭借神经辐射场 (NeRF) 和 3D 高斯散射 (3DGS) 得到了快速发展。通过利用丰富的预先训练的先验知识，少量镜头方法可以从短短几秒钟的视频中实现即时个性化。然而，在富有表现力的面部运动下，现有的少样本方法经常遭受几何不稳定和音频情感不匹配的影响，这凸显了对更有效的情感感知运动建模的需求。在这项工作中，我们提出了 EmoTaG，这是一个基于预训练和适应范式构建的几次镜头情感感知 3D 头部说话合成框架。我们的主要见解是在结构化 FLAME 参数空间中重新制定运动预测，而不是直接使 3D 高斯变形，从而引入显式几何先验来提高运动稳定性。在此基础上，我们提出了门控残差运动网络（GRMN），它可以从音频中捕获情感韵律，同时补充音频中缺少的头部姿势和上脸线索，从而实现富有表现力和连贯的运动生成。大量实验表明，EmoTaG 在情感表达、唇形同步、视觉真实感和运动稳定性方面实现了最先进的性能。

Title: Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution

Authors: Yu-Shan Tai, An-Yeu (Andy)Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21348
Pdf URL: https://arxiv.org/pdf/2603.21348
Copy Paste: [[2603.21348]] Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution(https://arxiv.org/abs/2603.21348)
Keywords: generation
Abstract: Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.
摘要：最近，扩散模型（DM）在高质量图像生成方面取得了重大进展。然而，多步骤去噪过程通常会导致相当大的计算开销，阻碍了在资源受限的边缘设备上的部署。现有方法通过压缩模型和调整时间步序列来缓解这个问题。然而，它们忽略了输入冗余并且需要很长的搜索时间。在本文中，我们提出了具有时间步序列重新分布的从粗到细的扩散模型。识别出无法区分的早期生成图像，我们引入了粗到细去噪（C2F）来减少粗略特征生成期间的计算。此外，我们设计了时间步序列重分配（TRD）来实现高效的采样轨迹调整，搜索时间不到10分钟。实验结果表明，所提出的方法在 CIFAR10 和 LSUN-Church 上实现了近乎无损的性能，计算量减少了 80% 到 90%。

Title: Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

Authors: Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, Ioannis Patras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21366
Pdf URL: https://arxiv.org/pdf/2603.21366
Copy Paste: [[2603.21366]] Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation(https://arxiv.org/abs/2603.21366)
Keywords: generation
Abstract: Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
摘要：自回归（AR）视频扩散最近已成为长视频生成的一种有前途的范例，使因果合成超越了双向模型的限制。为了解决训练与推理不匹配的问题，人们提出了一系列自我强制策略，通过在训练期间根据模型自身的预测调节模型来提高推出稳定性。虽然这些方法大大减轻了暴露偏差，但由于时间逐渐退化，将发电范围扩展到分钟尺度仍然具有挑战性。在这项工作中，我们表明这种限制主要不是由内存不足引起的，而是由推理过程中临时内存的利用方式引起的。通过实证分析，我们发现增加记忆并不能持续改善长视域生成，并且历史背景的时间位置显着影响运动动态，同时视觉质量基本保持不变。这些发现表明，时间记忆不应被视为同质缓冲区。受这一见解的启发，我们引入了 Relax Forcing，一种用于 AR 扩散的结构化时间记忆机制。 Relax Forcing 没有关注密集生成的历史，而是将时间上下文分解为三个功能角色：用于全局稳定性的 Sink，用于短期连续性的 Tail，以及用于结构运动指导的动态选择的 History，并有选择地仅合并最相关的过去信息。这种设计减少了外推过程中的误差累积，同时保留了运动演化。 VBench-Long 上的实验表明，Relax Forcing 可以改善运动动态和整体时间一致性，同时减少注意力开销。我们的结果表明，结构化时间记忆对于可扩展的长视频生成至关重要，补充了现有的基于强制的训练策略。

Title: DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Authors: James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.21461
Pdf URL: https://arxiv.org/pdf/2603.21461
Copy Paste: [[2603.21461]] DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment(https://arxiv.org/abs/2603.21461)
Keywords: generation
Abstract: Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.
摘要：偏好对齐通常是通过偏好数据的权重更新训练来实现的，这增加了大量的对齐阶段计算并提供有限的机械可见性。我们提出了用于偏好对齐的动态 SAE 转向 (DSPA)，这是一种推理时间方法，使稀疏自动编码器 (SAE) 转向具有提示条件。根据偏好三元组，DSPA 计算将提示特征与生成控制特征联系起来的条件差异图；在解码期间，它仅修改令牌活动潜在变量，而不更新基本模型权重。在 Gemma-2-2B/9B 和 Qwen3-8B 中，DSPA 改进了 MT-Bench，在 AlpacaEval 上具有竞争力，同时保持了多项选择的准确性。在受限偏好数据下，DSPA 仍然稳健，可以与两级 RAHF-SCIT 管道相媲美，同时需要的对齐阶段 FLOP 减少高达 $4.47\times$。最后，我们审核了 DSPA 修改的 SAE 特征，发现偏好方向由话语和文体信号主导，并提供理论阐明条件差异图估计以及何时原则上进行 top-$k$ 消融。

Title: Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models

Authors: Hyundong Jin, Dongyoon Han, Eunwoo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21484
Pdf URL: https://arxiv.org/pdf/2603.21484
Copy Paste: [[2603.21484]] Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models(https://arxiv.org/abs/2603.21484)
Keywords: generation
Abstract: Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.
摘要：持续的忘却带来了挑战，使大型视觉语言模型能够选择性地拒绝特定的图像指令对以响应顺序删除请求，同时保留通用性。然而，连续的遗忘更新会扭曲共享表征，在视觉-语言对和拒绝行为之间产生虚假关联，从而阻碍拒绝目标的精确识别，从而导致不适当的拒绝。为了应对这一挑战，我们提出了一种新颖的持续忘却框架，该框架将拒绝行为基于从删除目标分解的视觉和文本概念的细粒度描述。我们首先通过概念调节器确定哪些视觉语言概念组合表征了每个遗忘类别，然后确定如何通过拒绝专家（称为拒绝者）的混合来生成适当的拒绝响应，每个拒绝专家专门用于概念对齐的拒绝生成。为了跨顺序任务生成特定于概念的拒绝响应，我们引入了一种多模式、概念驱动的路由方案，该方案将拒绝者重用于共享相似概念的任务，并针对新概念调整未充分利用的拒绝者。对视觉语言基准的大量实验表明，所提出的框架通过生成基于概念的拒绝响应并保持跨学习序列的通用性，优于现有方法。

Title: Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation

Authors: Jingnan Luo, Mingqi Gao, Jun Liu, Bin-Bin Gao, Feng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21488
Pdf URL: https://arxiv.org/pdf/2603.21488
Copy Paste: [[2603.21488]] Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation(https://arxiv.org/abs/2603.21488)
Keywords: generation
Abstract: The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at this https URL.
摘要：多模态大语言模型（MLLM）的繁荣刺激了对视频推理分割的需求，其目的是基于人类指令来分割视频对象。以前的研究依赖于单向和隐式文本轨迹对齐，当面对严重的视频动态时，它会与轨迹感知作斗争。在这项工作中，我们提出了 TrajSeg，一个基于 MLLM 构建的简单且统一的框架。具体来说，我们引入了双向文本轨迹对齐，其中 MLLM 接受基础意图（文本到轨迹）和字幕意图（轨迹到文本）指令。通过这种方式，MLLM 可以受益于增强的对应性并更好地感知视频中的对象轨迹。从轨迹生成掩模是通过帧级内容集成（FCI）模块和统一掩模解码器实现的。前者使 MLLM 解析的轨迹级标记适应特定于帧的信息。后者将所有帧的分割统一为单个结构，从而使所提出的框架得以简化并可进行端到端训练。对参考和推理视频分割数据集的大量实验证明了 TrajSeg 的有效性，它在所有指标上都优于所有视频推理分割方法。该代码将在此 https URL 上公开提供。

Title: VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

Authors: Xinghan Li, Junhao Xu, Jingjing Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21526
Pdf URL: https://arxiv.org/pdf/2603.21526
Copy Paste: [[2603.21526]] VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection(https://arxiv.org/abs/2603.21526)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
摘要：多模态大语言模型（MLLM）通过生成文本解释为可解释的深度伪造检测提供了一条有希望的途径。然而，当前基于 MLLM 的方法的推理过程将证据生成和操作定位结合成一个统一的步骤。这种结合模糊了忠实观察和幻觉解释之间的界限，导致了不可靠的结论。在此基础上，我们提出了 VIGIL，一个以部件为中心的结构化取证框架，其灵感来自于专家取证实践，通过计划然后检查的流程：该模型首先根据全局视觉线索计划哪些面部部位值得检查，然后使用独立来源的取证证据检查每个部分。阶段门控注入机制仅在检查期间提供零件级取证证据，确保零件选择仍然由模型自身的感知驱动，而不是受到外部信号的影响。我们进一步提出了一种渐进的三阶段训练范例，其强化学习阶段采用部分感知奖励来加强解剖有效性和证据-结论的一致性。为了实现严格的普遍性评估，我们构建了 OmniFake，这是一个分层的 5 级基准，其中模型仅在三个基础生成器上进行训练，并逐步针对野外社交媒体数据进行测试。对 OmniFake 和跨数据集评估的大量实验表明，VIGIL 在所有通用性级别上始终优于专家检测器和基于 MLLM 的并发方法。

Title: From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy

Authors: Bi'an Du, Daizong Liu, Pufan Li, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21557
Pdf URL: https://arxiv.org/pdf/2603.21557
Copy Paste: [[2603.21557]] From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy(https://arxiv.org/abs/2603.21557)
Keywords: generation, generative
Abstract: Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.
摘要：单图像 3D 生成是现实世界中视觉到图形模型的核心。然而，在稀疏监督下实现跨不同语义类别和高度可变的结构复杂性的可靠泛化仍然是一个基本挑战。现有方法通常以整体方式对对象进行建模，或者依赖于固定数量的部件，包括最近的部件感知模型，例如 PartCrafter，它仍然需要劳动密集型的用户指定部件计数。当遇到新的对象布局时，这种设计很容易导致过度拟合、结构组件碎片或缺失，以及有限的组合泛化。为此，本文将单图像 3D 生成重新思考为在灵活的 3D 潜在空间中学习自适应部分整体层次结构。我们提出了一种新颖的从部分到整体的 3D 生成世界模型，该模型通过直接从图像标记推断软掩模和组合掩模来自主发现潜在的结构槽。具体来说，自适应时隙门控机制动态确定逐时隙激活概率，并平滑地合并不同对象内的冗余时隙，确保新兴结构在跨类别时保持紧凑且富有表现力。然后，每个蒸馏槽都与可学习的、与类别无关的原型库对齐，通过现实世界中的通用几何原型实现强大的跨类别形状共享和去噪。此外，还引入了轻量级 3D 降噪器，通过统一的扩散目标来重建几何形状和外观。实验表明，在跨类别转移和零件计数外推方面取得了一致的收益，并且消融证实了原型库对于形状先验共享以及用于结构适应的槽门的互补优势。

Title: Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

Authors: Minseok Kang, Minhyeok Lee, Minjung Kim, Jungho Lee, Donghyeong Kim, Sungmin Woo, Inseok Jeon, Sangyoun Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21559
Pdf URL: https://arxiv.org/pdf/2603.21559
Copy Paste: [[2603.21559]] Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning(https://arxiv.org/abs/2603.21559)
Keywords: generation
Abstract: Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy this http URL address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.
摘要：弱监督视频场景图生成（WS-VSGG）旨在将视频内容解析为结构化关系三元组，无需边界框注释，仅具有稀疏时间标记，从而显着降低注释成本。如果没有地面实况边界框，这些方法依赖现成的检测器来生成目标建议，但在很大程度上忽略了与完全监督管道的根本差异。完全监督的检测器隐式过滤掉非交互对象，而现成的检测器不加区别地检测所有可见对象，通过引入可学习的对亲和力来估计主体-对象对之间交互的可能性，用嘈杂的此 http URL 压倒关系模型来解决这个问题。通过配对亲和力学习和评分 (PALS)，配对亲和力被纳入推理时间排名，并通过配对亲和力调制 (PAM) 进一步集成到上下文推理中，使模型能够抑制非交互配对并专注于具有相关意义的配对。为了为配对亲和力学习提供更清晰的监督，我们进一步提出了关系感知匹配（RAM），它利用视觉语言基础来解决伪标签生成中的类级歧义。对 Action Genome 的大量实验表明，我们的方法在不同的基线和主干上持续产生显着的改进，实现了最先进的 WS-VSGG 性能。

Title: Riemannian Geometry Speaks Louder Than Words: From Graph Foundation Model to Next-Generation Graph Intelligence

Authors: Philip S. Yu, Li Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21601
Pdf URL: https://arxiv.org/pdf/2603.21601
Copy Paste: [[2603.21601]] Riemannian Geometry Speaks Louder Than Words: From Graph Foundation Model to Next-Generation Graph Intelligence(https://arxiv.org/abs/2603.21601)
Keywords: generation
Abstract: Graphs provide a natural description of the complex relationships among objects, and play a pivotal role in communications, transportation, social computing, the life sciences, etc. Currently, there is strong agreement that Graph Foundation Models (GFMs) are essential for advancing graph learning, yet considerable disagreement persists on how to build a powerful, general-purpose GFM analogous to Large Language Models (LLMs). Graph Neural Networks (GNNs) exhibit limitations in memory retention and principled interpretability when confronted with multi-domain pretraining and adaptation. The challenge of graph serialization hinders the direct application of LLMs, as the words struggle to capture the structural complexity and diversity inherent in graphs. In contrast, Riemannian geometry offers an elegant mathematical framework for modeling structures, while remaining compatible with graph semantic learning, even with LLMs. In this paper, we argue that, for graphs, Riemannian geometry speaks louder than words, and lay out the foundational principles for GFM. Reimagining with Riemannian geometry, we introduce a blue sky idea-Riemannian Foundation Model (RFM)-that opens a new pathway for capturing complex structural patterns and uncovering cross-domain generalities. RFM emphasizes intrinsic graph geometry and embodies endogenous capacities for structural inference and generation, moving beyond mere representation-space switching. Accordingly, we outline a progressive agenda that begins with universal structural understanding through intrinsic geometry, and then rebuilds LLM with a Riemannian engine for general-purpose graph modeling and beyond. Thus, RFM enables a paradigm shift from designing graph models to solving graph-structured applications with RFM agents, unlocking the next-generation graph intelligence.
摘要：图提供了对象之间复杂关系的自然描述，并在通信、交通、社会计算、生命科学等领域发挥着关键作用。目前，人们普遍认为图基础模型（GFM）对于推进图学习至关重要，但对于如何构建类似于大型语言模型（LLM）的强大、通用的 GFM 仍存在很大分歧。当面对多域预训练和适应时，图神经网络（GNN）在记忆保留和原则可解释性方面表现出局限性。图序列化的挑战阻碍了法学硕士的直接应用，因为文字很难捕捉图固有的结构复杂性和多样性。相比之下，黎曼几何为结构建模提供了一个优雅的数学框架，同时保持与图语义学习（甚至是法学硕士）的兼容性。在本文中，我们认为，对于图而言，黎曼几何胜于雄辩，并阐述了 GFM 的基本原理。通过黎曼几何的重新想象，我们引入了一个蓝天想法——黎曼基础模型（RFM）——它为捕获复杂的结构模式和揭示跨领域的共性开辟了一条新途径。 RFM 强调内在的图几何形状并体现结构推理和生成的内生能力，而不仅仅是表示空间切换。因此，我们概述了一个渐进的议程，从通过内在几何的通用结构理解开始，然后使用黎曼引擎重建法学硕士，以实现通用图形建模等。因此，RFM 实现了从设计图模型到使用 RFM 代理解决图结构应用程序的范式转变，从而解锁了下一代图智能。

Title: SARe: Structure-Aware Large-Scale 3D Fragment Reassembly

Authors: Hanze Jia, Chunshi Wang, Yuxiao Yang, Zhonghua Jiang, Yawei Luo, Shuainan Ye, Tan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21611
Pdf URL: https://arxiv.org/pdf/2603.21611
Copy Paste: [[2603.21611]] SARe: Structure-Aware Large-Scale 3D Fragment Reassembly(https://arxiv.org/abs/2603.21611)
Keywords: generation, generative
Abstract: 3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.
摘要：3D片段重组旨在恢复公共对象坐标系中无序片段点云或网格的刚性姿势，以重建完整的形状。随着片段数量的增加，这个问题变得尤其具有挑战性，因为目标形状未知并且片段提供的语义线索较弱。由于不可靠的接触推理，尤其是不准确的片段邻接，现有的端到端方法很容易出现级联故障。为了解决这个问题，我们提出了结构感知重组（SARe），这是一种生成框架，其中 SARe-Gen 用于欧几里德空间组装生成，SARe-Refine 用于推理时间细化，并具有显式接触建模。 SARe-Gen 联合预测断裂表面标记概率和片段间接触图，以定位接触区域并推断候选邻接。它采用基于查询点的条件方案，并从冻结的几何编码器中提取查询位置处对齐的局部几何标记，从而生成可查询的结构表示，而无需额外的结构预训练。我们进一步引入推理时间细化阶段 SARe-Refine。通过几何一致性检查验证候选接触边缘，它选择可靠的子结构并对剩余的不确定区域重新采样，同时保持已验证的零件固定，从而在多片段状态下实现更稳定和一致的组件。我们通过三种设置评估 SARe，包括合成断裂、扫描真实物体的模拟断裂以及真实的物理断裂扫描。结果展示了最先进的性能，随着在具有挑战性的大规模重组中碎片数量的增加，具有更优雅的降解和更高的成功率。

Title: AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing

Authors: Guandong Li, Zhaobin Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21615
Pdf URL: https://arxiv.org/pdf/2603.21615
Copy Paste: [[2603.21615]] AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing(https://arxiv.org/abs/2603.21615)
Keywords: generation
Abstract: Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at this https URL
摘要：流匹配模型中基于反演的图像编辑已成为免训练、文本引导图像操作的强大范例。该范例的一个核心挑战是注入困境：在去噪期间注入源特征保留了原始图像的背景，但同时抑制了模型合成编辑内容的能力。现有的方法通过固定的注入策略（二进制开/关时间安排、统一的空间混合比和与通道无关的潜在扰动）来解决这个问题，这些策略忽略了时间和通道维度上注入需求的固有异构性。在本文中，我们提出了 AdaEdit，这是一种无需训练的自适应编辑框架，它通过两项互补的创新解决了这一困境。首先，我们提出了一种渐进注入计划，用连续衰减函数（S形、余弦或线性）取代硬二值截止，从而实现从源特征保留到目标特征生成的平滑过渡，并消除特征不连续伪影。其次，我们引入通道选择性潜在扰动，它根据反向潜在和随机潜在之间的分布差距来估计每个通道的重要性，并相应地应用差异化的扰动强度——强烈扰动编辑相关的通道，同时保留结构编码通道。 PIE-Bench 基准测试（700 张图像，10 种编辑类型）的大量实验表明，与强基线相比，AdaEdit 的 LPIPS 降低了 8.7%，SSIM 提高了 2.6%，PSNR 提高了 2.3%，同时保持了具有竞争力的 CLIP 相似性。 AdaEdit 完全即插即用，并与多种 ODE 求解器兼容，包括 Euler、RF-Solver 和 FireFlow。代码可在此 https URL 获取

Title: Efficient Zero-Shot AI-Generated Image Detection

Authors: Ryosuke Sonoda, Ramya Srinivasan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21619
Pdf URL: https://arxiv.org/pdf/2603.21619
Copy Paste: [[2603.21619]] Efficient Zero-Shot AI-Generated Image Detection(https://arxiv.org/abs/2603.21619)
Keywords: generation
Abstract: The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free this http URL experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly $10\%$ compared to SoTA, while maintaining substantially lower computational cost.
摘要：文本到图像模型的快速进步使得人工智能生成的图像变得越来越真实，这对生成内容的准确检测提出了重大挑战。虽然基于训练的检测器通常对未见过的图像的泛化能力有限，但免训练的方法提供了更好的鲁棒性，但难以捕捉真实图像和合成图像之间的细微差异。在这项工作中，我们提出了一种免训练的人工智能生成图像检测方法，该方法可测量对结构化频率扰动的表示敏感性，从而能够检测微小的操作。所提出的方法在计算上是轻量级的，因为扰动生成仅需要对输入图像进行单个傅立叶变换。因此，与大多数无需训练的 http URL 实验相比，它在具有挑战性的基准测试上实现了一到两个数量级的推理速度，证明了我们的方法相对于最先进的方法 (SoTA) 的有效性。特别是，在 OpenFake 基准测试中，与 SoTA 相比，我们的方法将 AUC 提高了近 10%$，同时保持了较低的计算成本。

Title: Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

Authors: Yuehu Gong, Zeyuan Wang, Yulin Chen, Yanwei Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.21621
Pdf URL: https://arxiv.org/pdf/2603.21621
Copy Paste: [[2603.21621]] Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective(https://arxiv.org/abs/2603.21621)
Keywords: generation, generative
Abstract: On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.
摘要：具有生成策略的在策略强化学习很有前景，但仍尚未得到充分探索。一个核心挑战是，近端策略优化（PPO）传统上是根据行动空间概率比来制定的，而基于扩散和流动的策略更自然地表示为轨迹级生成过程。在这项工作中，我们提出了 GSB-PPO，这是一种受广义薛定谔桥（GSB）启发的生成式 PPO 的路径空间公式。我们的框架将 PPO 式的近端更新从终端操作提升到完整的生成轨迹，从而产生生成策略的策略优化的统一视图。在此框架内，我们制定了两个具体目标：基于裁剪的目标 GSB-PPO-Clip 和基于惩罚的目标 GSB-PPO-Penalty。实验结果表明，虽然两个目标都与在策略训练兼容，但惩罚公式始终比裁剪模型提供更好的稳定性和性能。总的来说，我们的结果强调了路径空间近端正则化是使用 PPO 训练生成策略的有效原则。

Title: TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints

Authors: Vagish Kumar, Syed Bahauddin Alam, Souvik Chakraborty
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2603.21656
Pdf URL: https://arxiv.org/pdf/2603.21656
Copy Paste: [[2603.21656]] TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints(https://arxiv.org/abs/2603.21656)
Keywords: generation
Abstract: Protecting patient privacy remains a fundamental barrier to scaling machine learning across healthcare institutions, where centralizing sensitive data is often infeasible due to ethical, legal, and regulatory constraints. Federated learning offers a promising alternative by enabling privacy-preserving, multi-institutional training without sharing raw patient data; however, real-world deployments face severe challenges from data heterogeneity, site-specific biases, and class imbalance, which degrade predictive reliability and render existing uncertainty quantification methods ineffective. Here, we present TrustFed, a federated uncertainty quantification framework that provides distribution-free, finite-sample coverage guarantees under heterogeneous and imbalanced healthcare data, without requiring centralized access. TrustFed introduces a representation-aware client assignment mechanism that leverages internal model representations to enable effective calibration across institutions, along with a soft-nearest threshold aggregation strategy that mitigates assignment uncertainty while producing compact and reliable prediction sets. Using over 430,000 medical images across six clinically distinct imaging modalities, we conduct one of the most comprehensive evaluations of uncertainty-aware federated learning in medical imaging, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes. By validating TrustFed at this scale and breadth, our study advances uncertainty-aware federated learning from proof-of-concept toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.
摘要：保护患者隐私仍然是在医疗机构中扩展机器学习的基本障碍，由于道德、法律和监管的限制，集中敏感数据通常是不可行的。联合学习提供了一种有前途的替代方案，它可以在不共享原始患者数据的情况下实现隐私保护、多机构培训；然而，现实世界的部署面临着数据异质性、特定地点偏差和类别不平衡的严峻挑战，这降低了预测可靠性并使现有的不确定性量化方法无效。在这里，我们提出了 TrustFed，这是一个联合不确定性量化框架，它可以在异构和不平衡的医疗数据下提供无分布、有限样本的覆盖保证，而无需集中访问。 TrustFed 引入了一种表示感知客户分配机制，该机制利用内部模型表示来实现跨机构的有效校准，以及软最近阈值聚合策略，可减轻分配不确定性，同时生成紧凑且可靠的预测集。我们使用六种临床不同成像模式的超过 430,000 张医学图像，对医学成像中的不确定性联合学习进行了最全面的评估之一，证明了跨具有不同类基数和不平衡机制的数据集的强大覆盖保证。通过在这种规模和广度上验证 TrustFed，我们的研究将不确定性感知联合学习从概念验证推进到具有临床意义、与模态无关的部署，将统计保证的不确定性定位为下一代医疗保健人工智能系统的核心要求。

Title: OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

Authors: Meilin Liu, Jiaying Wang, Jing Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21660
Pdf URL: https://arxiv.org/pdf/2603.21660
Copy Paste: [[2603.21660]] OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging(https://arxiv.org/abs/2603.21660)
Keywords: super-resolution
Abstract: Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.
摘要：联邦学习（FL）已成为协作医学图像分析的一个有前途的范例，但现有框架仍然与特定任务的主干网紧密耦合，并且在异构成像模式下很脆弱。这些限制阻碍了现实世界的部署，其中机构的模式分布差异很大，并且必须支持不同的下游任务。为了解决这个限制，我们提出了 OmniFM，一种与模态和任务无关的 FL 框架，它统一了分类、分割、超分辨率、视觉问答和多模态融合的训练，而无需重新设计优化管道。 OmniFM 建立在关键的频域洞察之上：低频频谱分量表现出强大的跨模态一致性，并编码模态不变的解剖结构。因此，OmniFM 集成了（i）全局谱知识检索以注入全局频率先验，（ii）嵌入明智的交叉注意融合以对齐表示，以及（iii）前缀后缀谱提示以联合调节全局和个性化提示，并通过稳定聚合的谱近端对齐目标进行正则化。对真实世界数据集的实验表明，OmniFM 在模态内和跨模态异质性方面始终超越最先进的 FL 基线，在微调和从头开始训练设置下均取得了优异的结果。

Title: Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis

Authors: Kangbo Zhao, Miaoxin Guan, Xiang Chen, Yukai Shi, Jinshan Pan
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2603.21661
Pdf URL: https://arxiv.org/pdf/2603.21661
Copy Paste: [[2603.21661]] Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis(https://arxiv.org/abs/2603.21661)
Keywords: generation
Abstract: Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.
摘要：图像去雨在低级计算机视觉中发挥着关键作用，是强大的户外监控和自动驾驶系统的先决条件。虽然深度学习范式在严格一致的环境中取得了显着的成功，但当推广到看不见的分布外（OOD）场景时，它们常常会遭受严重的性能下降。这种失败主要源于合成训练数据集与现实世界降雨的复杂物理动力学之间存在显着的领域差异。为了应对这些挑战，本文提出了一个开创性的跨场景除雨适应框架。与传统方法不同，我们的方法消除了目标域中配对雨天观测的要求，仅利用无雨背景图像。我们设计了一个超像素生成（Sup-Gen）模块，使用简单线性迭代聚类从源域中提取稳定的结构先验。随后，引入分辨率自适应融合策略，通过纹理相似性将这些源结构与目标背景对齐，确保合成多样化且真实的伪数据。最后，我们实现了一种伪标签重新合成机制，该机制采用多级噪声生成来模拟真实的雨条纹。该框架充当多功能即插即用模块，能够无缝集成到任意去雨架构中。对最先进模型的大量实验表明，我们的方法在 OOD 领域产生了高达 32% 至 59% 的显着 PSNR 增益，同时显着加速了训练收敛。

Title: Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Authors: Hung-Hsuan Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.21676
Pdf URL: https://arxiv.org/pdf/2603.21676
Copy Paste: [[2603.21676]] Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization(https://arxiv.org/abs/2603.21676)
Keywords: generation
Abstract: Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
摘要：标准 Transformer 具有固定的计算深度，从根本上限制了它们泛化到需要可变深度推理的任务的能力，例如多跳图遍历或嵌套逻辑。我们提出了一种深度循环 Transformer，通过在潜在空间中迭代应用共享权重 Transformer 块，将计算深度与参数计数解耦——使模型能够在推理时用循环步骤进行更深入的推理。我们的架构结合了三种机制来使深度递归（20+步）稳定：（1）一个静默的思维目标，仅监督最终输出，强制进行真正的多步骤推理，而不是中间的启发式捷径； (2) LayerScale初始化，以保护脆弱的推理状态免受未经训练的层噪声的影响； (3) 一种基于身份的递归，它创建了一条跨越许多步骤的梯度高速公路。我们评估了三个减少归纳偏差的组合推理领域：图可达性（严格邻接屏蔽）、嵌套布尔逻辑（相对定位）和非结构化关系文本（其中序列位置不提供结构提示）。在所有任务中，我们观察到一个清晰的\emph{计算前沿}——随着思维步骤随着任务复杂性的增加，性能从偶然转变为近乎完美的边界。此外，这些任务揭示了性质不同的泛化行为：精确但脆弱（图）、近似但鲁棒（逻辑）以及没有结构提示的自主潜在路由（文本）。这一进展阐明了任务不变的循环推理核心和特定于任务的感知接口之间的相互作用如何形成分布外（OOD）泛化，提供了垂直思想链的机械视角，补充了流行的水平令牌生成范式。

Title: When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?

Authors: Bahar Dibaei Nia, Farzan Farnia
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.21716
Pdf URL: https://arxiv.org/pdf/2603.21716
Copy Paste: [[2603.21716]] When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?(https://arxiv.org/abs/2603.21716)
Keywords: generative
Abstract: Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emph{Mixture-Greedy} strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.
摘要：在现代生成人工智能中，多个生成模型之间的有效选择变得越来越重要，从次优模型中进行采样的成本很高。这个问题可以表述为多臂老虎机任务。在多样性感知评估指标下，生成器的非退化混合物可以胜过任何单个模型，从而将这种设置与经典的最佳臂识别区分开来。因此，先前的方法将置信上限 (UCB) 探索奖励纳入混合目标中。然而，在多个数据集和评估指标中，我们观察到 UCB 项始终会减慢收敛速度，并且常常会降低样本效率。相比之下，没有明确 UCB 型乐观主义的简单 \emph{Mixture-Greedy} 策略收敛速度更快，并实现更好的性能，特别是对于难以构造严格置信界限的广泛使用的指标（例如 FID 和 Vendi）。我们提供了解释这种行为的理论见解：在透明的结构条件下，多样性感知目标通过有利于内部混合来诱导隐式探索，从而导致所有臂的线性采样以及基于熵、基于核和 FID 类型目标的亚线性遗憾保证。这些结果表明，在用于生成模型选择的具有多样性意识的多臂老虎机中，探索本质上可以从客观几何中产生，质疑显式置信奖励的必要性。

Title: Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging

Authors: Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.21717
Pdf URL: https://arxiv.org/pdf/2603.21717
Copy Paste: [[2603.21717]] Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging(https://arxiv.org/abs/2603.21717)
Keywords: generation, generative
Abstract: Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires both reliability (generalization across labs, devices, and experimental conditions) and accountability (detecting out-of-distribution cases where predictions may be unreliable). Uncertainty quantification (UQ) based approaches serve as promising candidates for these tasks, yet UQ for distribution-to-distribution generative models remains underexplored. We present a unified UQ framework, Bayesian Stochastic Flow Matching (BSFM), that disentangles aleatoric and epistemic uncertainty. The Stochastic Flow Matching (SFM) component augments deterministic flows with a diffusion term to improve model generalization to unseen scenarios. For UQ, we develop a scalable Bayesian approach -- MCD-Antithetic -- that combines Monte Carlo Dropout with sample-efficient antithetic sampling to produce effective anomaly scores for out-of-distribution detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability.
摘要：分布到分布生成模型支持科学成像任务，从细胞扰动响应建模到跨条件翻译医学图像。值得信赖的一代需要可靠性（跨实验室、设备和实验条件的概括）和责任感（检测预测可能不可靠的分布外情况）。基于不确定性量化 (UQ) 的方法是这些任务的有希望的候选者，但分布到分布生成模型的 UQ 仍未得到充分探索。我们提出了一个统一的昆士兰大学框架，贝叶斯随机流匹配（BSFM），它可以消除任意和认知的不确定性。随机流匹配 (SFM) 组件通过扩散项增强确定性流，以提高模型对未见过场景的泛化能力。对于昆士兰大学，我们开发了一种可扩展的贝叶斯方法——MCD-Antithetic——它将蒙特卡洛辍学与样本高效的对偶采样相结合，为分布外检测产生有效的异常分数。跨不同场景的细胞成像（BBBC021、JUMP）和大脑功能磁共振成像（心理理论）实验表明，SFM 提高了可靠性，而 MCD-Antithetic 增强了责任感。

Title: CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

Authors: Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2603.21743
Pdf URL: https://arxiv.org/pdf/2603.21743
Copy Paste: [[2603.21743]] CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning(https://arxiv.org/abs/2603.21743)
Keywords: generation, generative
Abstract: Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.
摘要：利用生成模型构建虚拟细胞来模拟计算机中的细胞行为，正在成为加速药物发现的一种有前途的范例。然而，先前基于图像的生成方法可能会产生令人难以置信的细胞图像，违反基本的物理和生物限制。为了解决这个问题，我们建议通过强化学习（RL）对虚拟细胞模型进行后训练，利用具有生物学意义的评估器作为奖励函数。我们设计了涵盖三个类别（生物功能、结构有效性和形态正确性）的七种奖励，并优化了最先进的 CellFlux 模型以产生 CellFluxRL。 CellFluxRL 在所有奖励方面始终优于 CellFlux，并通过测试时间扩展进一步提高性能。总体而言，我们的结果提出了一个虚拟细胞建模框架，该框架通过强化学习强制执行基于物理的约束，超越“视觉上逼真”的世代，迈向“具有生物学意义”的世代。

Title: Show Me What You Don't Know: Efficient Sampling from Invariant Sets for Model Validation

Authors: Armand Rousselot, Joran Wendebourg, Ullrich Köthe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.21782
Pdf URL: https://arxiv.org/pdf/2603.21782
Copy Paste: [[2603.21782]] Show Me What You Don't Know: Efficient Sampling from Invariant Sets for Model Validation(https://arxiv.org/abs/2603.21782)
Keywords: generation, generative
Abstract: The performance of machine learning models is determined by the quality of their learned features. They should be invariant under irrelevant data variation but sensitive to task-relevant details. To visualize whether this is the case, we propose a method to analyze feature extractors by sampling from their fibers -- equivalence classes defined by their invariances -- given an arbitrary representative. Unlike existing work where a dedicated generative model is trained for each feature detector, our algorithm is training-free and exploits a pretrained diffusion or flow-matching model as a prior. The fiber loss -- which penalizes mismatch in features -- guides the denoising process toward the desired equivalence class, via non-linear diffusion trajectory matching. This replaces days of training for invariance learning with a single guided generation procedure at comparable fidelity. Experiments on popular datasets (ImageNet, CheXpert) and model types (ResNet, DINO, BiomedClip) demonstrate that our framework can reveal invariances ranging from very desirable to concerning behaviour. For instance, we show how Qwen-2B places patients with situs inversus (heart on the right side) in the same fiber as typical anatomy.
摘要：机器学习模型的性能取决于其学习特征的质量。它们在不相关的数据变化下应该保持不变，但对任务相关的细节敏感。为了可视化是否是这种情况，我们提出了一种通过从纤维（由其不变性定义的等价类）采样来分析特征提取器的方法，给定任意代表。与为每个特征检测器训练专用生成模型的现有工作不同，我们的算法无需训练，并利用预先训练的扩散或流匹配模型作为先验。光纤损耗（它会惩罚特征的不匹配）通过非线性扩散轨迹匹配引导去噪过程达到所需的等价类。这用具有相当保真度的单个引导生成程序取代了数天的不变性学习训练。对流行数据集（ImageNet、CheXpert）和模型类型（ResNet、DINO、BiomedClip）的实验表明，我们的框架可以揭示从非常理想的行为到令人担忧的行为的不变性。例如，我们展示了 Qwen-2B 如何将逆位患者（心脏位于右侧）放置在与典型解剖结构相同的纤维中。

Title: SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

Authors: Bingxuan Zhao, Qing Zhou, Chuang Yang, Qi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21783
Pdf URL: https://arxiv.org/pdf/2603.21783
Copy Paste: [[2603.21783]] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis(https://arxiv.org/abs/2603.21783)
Keywords: generation, generative
Abstract: Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at this https URL.
摘要：由扩散变压器 (DiT) 支持的文本到图像生成已经取得了显着的进步，但遥感 (RS) 合成由于两个障碍而滞后：缺乏领域专用的 DiT 先验，以及在 RS 应用程序所需的高分辨率下进行训练的成本过高。通过旋转位置嵌入（RoPE）重新缩放来进行免训练的分辨率提升提供了一种实用的补救措施，但每种现有方法在整个去噪过程中都应用静态位置缩放规则。这种均匀的压缩对于遥感图像尤其有害，遥感图像的密集中频和高频能量编码了对航空场景真实感至关重要的精细结构，例如车辆、建筑物轮廓和道路标记。解决这两个挑战需要特定领域的生成先验以及去噪感知的位置适应策略。为此，我们在超过 100,000 个精选 RS 图像上对 FLUX 进行微调，以构建强大的域先验 (RS-FLUX)，并提出频谱感知高动态分辨率提升适应 (SHARP)，这是一种无需训练的方法，将合理的分数时间安排 k_rs(t) 引入 RoPE 中。 SHARP 在早期布局形成阶段应用强烈的位置提升，并在细节恢复期间逐渐放松它，使外推强度与扩散去噪的频率渐进性质保持一致。其与分辨率无关的公式进一步支持从一组超参数进行稳健的多尺度生成。跨越六个正方形和矩形分辨率的大量实验表明，SHARP 在 CLIP 分数、美学分数和 HPSv2 上始终优于所有免训练基线，并且在更激进的外推因子和可忽略的计算开销下扩大了裕度。代码和权重可从此 https URL 获取。

Title: Dynamic Exposure Burst Image Restoration

Authors: Woohyeok Kim, Jaesung Rim, Daeyeon Kim, Sunghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21784
Pdf URL: https://arxiv.org/pdf/2603.21784
Copy Paste: [[2603.21784]] Dynamic Exposure Burst Image Restoration(https://arxiv.org/abs/2603.21784)
Keywords: restoration
Abstract: Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings. Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked. In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment. In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain. Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times. For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.
摘要：连拍图像恢复旨在从连拍图像重建高质量图像，这些连拍图像通常是使用手动设计的曝光设置捕获的。尽管这些曝光设置显着影响最终的修复性能，但寻找最佳曝光设置的问题却被忽视了。在本文中，我们提出了动态曝光连拍图像恢复（DEBIR），这是一种新颖的连拍图像恢复流程，可通过动态预测适合拍摄环境的曝光时间来提高恢复质量。在我们的流程中，连拍自动曝光网络 (BAENet) 根据预览图像以及运动幅度和增益来估计每个连拍图像的最佳曝光时间。随后，连拍图像恢复网络根据使用这些最佳曝光时间捕获的连拍图像重建高质量图像。对于训练，我们引入了可微突发模拟器和三阶段训练策略。我们的实验表明，我们的管道达到了最先进的修复质量。此外，我们在现实世界的相机系统上验证了我们的方法的有效性，证明了其实用性。

Title: The Universal Normal Embedding

Authors: Chen Tasker, Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2603.21786
Pdf URL: https://arxiv.org/pdf/2603.21786
Copy Paste: [[2603.21786]] The Universal Normal Embedding(https://arxiv.org/abs/2603.21786)
Keywords: generation, generative
Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available this https URL
摘要：生成模型和视觉编码器在不同的轨道上取得了很大的进步，针对不同的目标进行了优化，并基于不同的数学原理。然而，它们有一个共同的基本属性：潜在空间高斯性。生成模型将高斯噪声映射到图像，而编码器将图像映射到语义嵌入，其坐标根据经验表现为高斯分布。我们假设两者都是共享潜在源的视图，通用法线嵌入（UNE）：一个近似高斯的潜在空间，编码器嵌入和 DDIM 反转噪声以噪声线性投影的形式出现。为了检验我们的假设，我们引入了 NoiseZoo，这是一个包含 DDIM 反转扩散噪声和匹配编码器表示（CLIP、DINO）的每图像潜在数据集。在 CelebA 上，两个空间中的线性探针都会产生强大的、对齐的属性预测，表明生成噪声沿着线性方向编码有意义的语义。这些方向进一步实现了忠实、可控的编辑（例如微笑、性别、年龄），而无需进行架构更改，其中简单的正交化可以减轻虚假纠缠。总而言之，我们的结果为 UNE 假设提供了实证支持，并揭示了一个共享的类高斯潜在几何结构，该几何结构具体链接了编码和生成。代码和数据可通过此 https URL 获取

Title: Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning

Authors: Sulian Thual, Feiyang Cai, Jingjing Wang, Feng Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21856
Pdf URL: https://arxiv.org/pdf/2603.21856
Copy Paste: [[2603.21856]] Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning(https://arxiv.org/abs/2603.21856)
Keywords: generative
Abstract: Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.
摘要：生成深度学习是热带地区马登-朱利安振荡 (MJO) 建模的强大工具，但它与传统理论框架的关系仍然知之甚少。在这里，我们提出了一种视频扩散模型，经过大气再分析训练，以合成以关键低维指标为条件的长 MJO 序列。生成的 MJO 捕获了关键特征，包括复合材料、功率谱和多尺度结构（包括对流耦合波），尽管存在一些偏差。然后，我们促使模型根据有意理想化的低维条件生成更容易处理的 MJO，例如永久 MJO、季节和/或厄尔尼诺-南方涛动的孤立调制等。这使得能够解构底层进程并识别物理驱动程序。本方法为弥合低维 MJO 理论和高分辨率大气复杂性之间的差距提供了一个实用框架，并将有助于热带大气预测。

Title: Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

Authors: Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen, Peng Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21864
Pdf URL: https://arxiv.org/pdf/2603.21864
Copy Paste: [[2603.21864]] Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation(https://arxiv.org/abs/2603.21864)
Keywords: generation, generative
Abstract: Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.
摘要：视频生成最近已成为生成人工智能领域的核心任务。然而，视频合成固有的大量计算成本使得模型蒸馏成为高效部署的关键技术。尽管其意义重大，但专门为视频扩散模型设计的方法却很少。流行的方法通常直接采用图像蒸馏技术，这经常导致诸如过饱和、时间不一致和模式崩溃等伪影。为了应对这些挑战，我们提出了一种专门为视频扩散模型量身定制的新颖蒸馏框架。其核心创新包括：（1）自适应回归损失，动态调整空间监督权重，以防止过度分布偏移产生的伪影； (2) 时间正则化损失，以抵消时间崩溃，促进平滑且物理上合理的采样轨迹； (3) 推理时间帧插值策略，可减少采样开销，同时保持感知质量。对 VBench 和 VBench2 基准的大量实验和消融研究表明，我们的方法实现了稳定的少步视频合成，显着增强了感知保真度和运动真实感。它在多个指标上始终优于现有的蒸馏基线。

Title: Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Authors: Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21872
Pdf URL: https://arxiv.org/pdf/2603.21872
Copy Paste: [[2603.21872]] Manifold-Aware Exploration for Reinforcement Learning in Video Generation(https://arxiv.org/abs/2603.21872)
Keywords: generation
Abstract: Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at this https URL.
摘要：用于视频生成的组相对策略优化 (GRPO) 方法（例如 FlowGRPO）仍然远不如语言模型和图像的对应方法可靠。出现这种差距的原因是视频生成具有复杂的解决方案空间，并且用于探索的 ODE 到 SDE 转换可能会注入过多的噪声，降低推出质量并使奖励估计不太可靠，从而破坏训练后对齐的稳定性。为了解决这个问题，我们将预训练模型视为定义有效的视频数据流形，并将核心问题表述为限制该流形附近的探索，确保保留推出质量并且奖励估计保持可靠。我们提出了 SAGE-GRPO（通过探索实现稳定对齐），它在微观和宏观层面都施加了约束。在微观层面，我们推导出具有对数曲率校正的精确流形感知 SDE，并引入梯度范数均衡器来稳定跨时间步长的采样和更新。在宏观层面，我们使用具有周期性移动锚和逐步约束的双重信任区域，以便信任区域跟踪更接近流形的检查点并限制长范围漂移。我们使用原始 VideoAlign 作为奖励模型在 HunyuanVideo1.5 上评估 SAGE-GRPO，并在 VQ、MQ、TA 和视觉指标（CLIPScore、PickScore）方面观察到与之前的方法相比一致的增益，在奖励最大化和整体视频质量方面表现出卓越的性能。代码和视觉库可从此 https URL 获取。

Title: Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline

Authors: Elías Masquil, Thibaud Ehret, Pablo Musé, Gabriele Facciolo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21882
Pdf URL: https://arxiv.org/pdf/2603.21882
Copy Paste: [[2603.21882]] Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline(https://arxiv.org/abs/2603.21882)
Keywords: generation
Abstract: Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.
摘要：从卫星图像生成数字表面模型是地球观测的一项核心任务，通常使用卫星管道（如卫星立体管道（S2P））中的经典立体匹配算法来解决。虽然最近基于学习的立体匹配器在标准基准上实现了最先进的性能，但由于观看几何形状和视差假设的差异，将它们集成到运营卫星管道中仍然具有挑战性。在这项工作中，我们将几个基于学习的现代立体匹配器（包括 StereoAnywhere、MonSter、Foundation Stereo 和 MonSter 的卫星微调变体）集成到卫星立体管道中，调整校正阶段以强制执行兼容的视差极性和范围。我们发布了相应的代码，以便在大规模地球观测工作流程中能够重复使用这些方法。卫星图像实验表明，尽管平均绝对误差等常用指标表现出饱和效应，但在数字表面模型精度方面，与传统的基于成本量的方法相比，仍取得了持续改进。定性结果揭示了几何细节的显着改善和更清晰的结构，强调需要更好地反映感知和结构保真度的评估策略。与此同时，在所有评估的模型中，在具有挑战性的表面类型（例如植被）上的性能仍然有限，这表明自然环境中基于学习的立体声面临着开放的挑战。

Title: Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation

Authors: Donald Shenaj, Federico Errica, Antonio Carta
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21884
Pdf URL: https://arxiv.org/pdf/2603.21884
Copy Paste: [[2603.21884]] Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation(https://arxiv.org/abs/2603.21884)
Keywords: generation
Abstract: Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community's consensus, regardless of the personalized subject's complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank's positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA$^2$, achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: this https URL.
摘要：低秩适应 (LoRA) 是事实上的微调策略，用于从预训练的扩散模型生成个性化图像。选择一个好的排名非常关键，因为它会权衡性能和内存消耗，但如今，无论个性化主题的复杂性如何，决定通常取决于社区的共识。原因很明显：为每个 LoRA 组件选择一个好的排名的成本是组合的，因此我们选择实用的捷径，例如为所有组件固定相同的排名。在本文中，我们迈出了克服这一挑战的第一步。受到学习神经网络自适应宽度的变分方法的启发，我们让每一层的等级在对主题进行微调时自由适应。我们通过对军衔职位的重要性进行排序来实现这一目标，有效地鼓励在严格需要时创建更高的军衔。从定性和定量角度来看，我们的方法 LoRA$^2$ 在 29 个受试者中实现了 DINO、CLIP-I 和 CLIP-T 之间的竞争性权衡，同时比高等级 LoRA 版本需要更少的内存和更低的等级。代码：此 https URL。

Title: CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Authors: Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21901
Pdf URL: https://arxiv.org/pdf/2603.21901
Copy Paste: [[2603.21901]] CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal(https://arxiv.org/abs/2603.21901)
Keywords: generation, generative
Abstract: Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
摘要：视频字幕去除旨在区分文本覆盖与背景内容，同时保持时间连贯性。现有的基于扩散的方法在训练和推理阶段都需要明确的掩模序列，这限制了它们的实际部署。在本文中，我们提出了 CLEAR（用于端到端自适应视频字幕去除的上下文感知学习），这是一种无掩模框架，可通过上下文感知自适应学习实现真正的端到端推理。我们的两阶段设计将先前的提取与生成细化解耦：第一阶段通过双编码器上的自监督正交性约束来学习解开的字幕表示，而第二阶段采用基于 LoRA 的自适应，并具有用于动态上下文调整的生成反馈。值得注意的是，我们的方法只需要基础扩散模型的 0.77% 的参数进行训练。在中文字幕基准测试中，CLEAR 的性能优于依赖于掩码的基线，PSNR 为 + 6.77dB，VFID 为 -74.7%，同时在六种语言（英语、韩语、法语、日语、俄语、德语）中展示了卓越的零样本泛化能力，这一性能是由我们的生成驱动反馈机制实现的，可确保在推理过程中无需使用真实掩码即可实现稳健的字幕去除。

Title: A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing

Authors: Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel, Matthieu Puigt
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2603.21911
Pdf URL: https://arxiv.org/pdf/2603.21911
Copy Paste: [[2603.21911]] A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing(https://arxiv.org/abs/2603.21911)
Keywords: generation, generative
Abstract: Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.
摘要：合成高光谱图像 (HSI) 生成对于大规模模拟、算法开发和任务设计至关重要，但传统的辐射传输模型的计算成本仍然很高，并且通常仅限于光谱级输出。在这项工作中，我们提出了一种基于潜在表示的高光谱仿真框架，用于学习高光谱数据的潜在生成表示。所提出的方法支持频谱级和空间频谱仿真，并且可以通过直接一步公式或将变分自动编码器（VAE）预训练与参数到潜在插值相结合的两步策略进行训练。对 PROSAIL 模拟植被数据和 Sentinel-3 OLCI 图像的实验表明，该方法在重建精度、光谱保真度和对现实空间变化的鲁棒性方面优于基于经典回归的模拟器。我们进一步表明，模拟的 HSI 保留了下游生物物理参数检索的性能，突出了模拟数据对于遥感应用的实际相关性。

Title: MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation

Authors: Wenqing Tian, Hanyi Mao, Zhaocheng Liu, Lihua Zhang, Qiang Liu, Jian Wu, Liang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21937
Pdf URL: https://arxiv.org/pdf/2603.21937
Copy Paste: [[2603.21937]] MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation(https://arxiv.org/abs/2603.21937)
Keywords: generation
Abstract: Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.
摘要：人们越来越期望主题驱动的图像生成能够支持对单个图像中多个实体的细粒度控制。在多参考工作流程中，用户可以提供多个主题图像、背景参考和长的实体索引提示来控制一个场景中的多个人。在此设置中，一个关键的失败模式是跨主题属性错误绑定：属性被保留、编辑或传输到错误的主题。现有的基准和指标主要强调整体保真度或每个主题的自相似性，使得此类故障难以诊断。我们推出 MultiBind，这是一个根据真实多人照片构建的基准。每个实例都提供带有掩码和边界框的按槽排序的主题裁剪、规范化的主题引用、修复的背景引用以及从结构化注释派生的密集实体索引提示。我们还提出了一种维度方面的混淆评估协议，将生成的主题与真实槽位相匹配，并使用面部身份、外观、姿势和表情专家来测量槽位之间的相似性。通过减去相应的真实相似矩阵，我们的方法将自我退化与真正的跨主体干扰分开，并暴露可解释的故障模式，例如漂移、交换、主导和混合。现代多参考生成器的实验表明，MultiBind 揭示了传统重建指标遗漏的绑定失败。

Title: GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design

Authors: Xiaolei Zhou, Chuangjie Fang, Jie Wu, Jingyi Yang, Boyi Lin, Jianwei Zheng
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2603.21978
Pdf URL: https://arxiv.org/pdf/2603.21978
Copy Paste: [[2603.21978]] GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design(https://arxiv.org/abs/2603.21978)
Keywords: generation
Abstract: Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.
摘要：参数化计算机辅助设计 (CAD) 是现代 3D 建模的基础，但现有方法难以生成长命令序列，尤其是在复杂的几何和拓扑依赖性下。基于 Transformer 的架构由于其强大的依赖性建模而在 CAD 序列生成中占据主导地位，但其二次注意力成本和有限的上下文窗口阻碍了长程序的可扩展性。我们提出了 GeoFusion-CAD，这是一种用于可扩展和结构感知生成的端到端扩散框架。我们的建议将 CAD 程序编码为分层树，在状态空间扩散过程中共同捕获几何和拓扑。具体来说，轻量级 C-Mamba 模块通过选择性状态转换对远程结构依赖性进行建模，从而实现跨扩展命令序列的连贯生成。为了支持长序列评估，我们引入了 DeepCAD-240，这是一种扩展基准，可将序列长度从 40 增加到 240，同时保留 ABC 数据集的草图挤出语义。大量实验表明，GeoFusion-CAD 在短命令范围和长命令范围内均实现了卓越的性能，在基于 Transformer 的模型性能下降的情况下保持了高几何保真度和拓扑一致性。我们的方法为长序列参数化 CAD 生成设定了新的最先进分数，为下一代 CAD 建模系统奠定了可扩展的基础。代码和数据集可在 GitHub 上获取。

Title: Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Authors: SII-GAIR, Sand.ai: Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21986
Pdf URL: https://arxiv.org/pdf/2603.21986
Copy Paste: [[2603.21986]] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model(https://arxiv.org/abs/2603.21986)
Keywords: super-resolution, generation, generative
Abstract: We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
摘要：我们推出了 daVinci-MagiHuman，这是一种用于以人为中心的生成的开源音视频生成基础模型。 daVinci-MagiHuman 使用单流 Transformer 联合生成同步视频和音频，该 Transformer 仅通过自注意力处理统一令牌序列中的文本、视频和音频。这种单流设计避免了多流或交叉注意架构的复杂性，同时易于通过标准训练和推理基础设施进行优化。该模型在以人为中心的场景中尤其强大，可产生富有表现力的面部表现、自然的语音表达协调、逼真的身体运动和精确的音视频同步。它支持中文（普通话和粤语）、英语、日语、韩语、德语和法语等多语言口语生成。为了高效推理，我们将单流主干与模型蒸馏、潜在空间超分辨率和 Turbo VAE 解码器相结合，从而能够在单个 H100 GPU 上在 2 秒内生成 5 秒 256p 视频。在自动评估中，daVinci-MagiHuman 在领先的开放模型中实现了最高的视觉质量和文本对齐，以及最低的语音清晰度错误率（14.60%）。在成对人类评估中，在 2000 次比较中，它对 Ovi 1.1 的胜率达到 80.0%，对 LTX 2.3 的胜率达到 60.9%。我们开源了完整的模型堆栈，包括基础模型、蒸馏模型、超分辨率模型和推理代码库。

Title: STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection

Authors: Jianlin Chen, Gongyang Li, Zhijiang Zhang, Liang Chang, Dan Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.21999
Pdf URL: https://arxiv.org/pdf/2603.21999
Copy Paste: [[2603.21999]] STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection(https://arxiv.org/abs/2603.21999)
Keywords: generation
Abstract: Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at this https URL.
摘要：基于 Transformer 的 RGB-D 显着目标检测 (SOD) 方法引起了人们的极大兴趣，因为 Transformer 具有捕获长距离像素依赖性的卓越能力。然而，当前的 RGB-D SOD 方法面临着挑战，例如注意力机制的二次复杂性和有限的局部细节提取。为了克服这些限制，我们提出了一种新颖的超像素令牌增强网络（STENet），它将超像素引入跨模式交互中。 STENet 遵循双流编码器-解码器结构。其核心是两个定制的超像素驱动的跨模态交互模块，负责全局和局部特征增强。具体来说，我们通过扩展每个超像素的邻域范围来更新超像素生成方法，从而允许像素和超像素之间的灵活转换。通过更新的超像素生成方法，我们首先提出超像素注意力全局增强模块来建模全局像素到超像素关系，而不是传统的全局像素到像素关系，可以捕获区域级信息并降低计算复杂度。我们还提出了超像素注意力局部细化模块，它利用超像素内的像素相似性来过滤掉像素子集（即局部像素），然后对这些局部像素进行特征增强，从而捕获相关的局部细节。此外，我们将全局和局部增强的特征与跨尺度特征融合，以实现全面的特征表示。对七个 RGB-D SOD 数据集的实验表明，与最先进的方法相比，我们的 STENet 实现了具有竞争力的性能。我们方法的代码和结果可在此 https URL 中找到。

Title: Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models

Authors: Purui Bai, Junxian Duan, Pin Wang, Jinhua Hao, Ming Sun, Chao Zhou, Huaibo Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22027
Pdf URL: https://arxiv.org/pdf/2603.22027
Copy Paste: [[2603.22027]] Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models(https://arxiv.org/abs/2603.22027)
Keywords: restoration
Abstract: Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
摘要：尽管基于扩散的真实世界图像恢复（Real-IR）已经取得了显着的进展，但有效利用超大规模预训练的文本到图像（T2I）模型并充分挖掘其潜力仍然是重大挑战。为了解决这个问题，我们提出了 ResFlow-Tuner，这是一种基于最先进的流匹配模型 FLUX.1-dev 的图像恢复框架，它将统一多模态融合（UMMF）与测试时间缩放（TTS）相结合，以实现前所未有的恢复性能。我们的方法充分利用多模态扩散变压器（MM-DiT）架构的优势，将多模态条件编码为统一的序列，指导高质量图像的合成。此外，我们引入了一种专为图像恢复量身定制的免训练测试时间缩放范例。在推理过程中，该技术通过奖励模型 (RM) 的反馈动态引导去噪方向，从而通过可控的计算开销实现显着的性能提升。大量的实验表明，我们的方法在多个标准基准测试中实现了最先进的性能。这项工作不仅验证了流匹配模型在低级视觉任务中的强大功能，更重要的是，提出了一种适用于大型预训练模型的新颖且高效的推理时间缩放范例。

Title: DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

Authors: Binhong Tan, Zhaoxin Wang, Handing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22041
Pdf URL: https://arxiv.org/pdf/2603.22041
Copy Paste: [[2603.22041]] DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation(https://arxiv.org/abs/2603.22041)
Keywords: generation
Abstract: Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.
摘要：文本到图像（T2I）扩散模型已表现出强大的生成能力，但它们生成不安全内容的潜力引起了重大的安全问题。现有的推理时间防御方法通常在文本嵌入空间中执行与类别无关的标记级干预，这无法捕获分布在整个标记序列中的恶意语义，并且仍然容易受到对抗性提示的影响。在本文中，我们提出了 DTVI，一种用于安全 T2I 生成的双阶段推理时间防御框架。与干预特定令牌嵌入的现有方法不同，我们的方法在完整提示嵌入上引入类别感知序列级干预，以更好地捕获分布式恶意语义，并进一步减弱视觉生成阶段剩余的不安全影响。对现实世界的不安全提示、对抗性提示和多种有害类别的实验结果表明，我们的方法实现了有效和稳健的防御，同时在良性提示上保持了合理的生成质量，在性类别基准上获得了 94.43％的平均防御成功率（DSR），在七个不安全类别上获得了 88.56 的平均防御成功率（DSR），同时保持了良性提示上的生成质量。

Title: FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation

Authors: Wuyang Luo, Chengkai Tan, Chang Ge, Binye Hong, Su Yang, Yongjiu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22054
Pdf URL: https://arxiv.org/pdf/2603.22054
Copy Paste: [[2603.22054]] FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation(https://arxiv.org/abs/2603.22054)
Keywords: generation
Abstract: Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.
摘要：艺术字体生成旨在基于参考样式合成风格化的字形。然而，现有的方法受到风格多样性有限和粗略控制的困扰。在这项工作中，我们探索了元素驱动的艺术字体生成的潜力。元素是字体的基本视觉单位，充当所需样式的参考图像。从概念上讲，我们将元素分类为具有不同结构的物体元素（例如，花或石头）和具有非结构化纹理的无定形元素（例如，火焰或云）。我们引入了 FontCrafter，一个用于字体创建的元素驱动框架，并构建了一个大规模数据集 ElementFont，其中包含多种元素类型和高质量的字形图像。然而，实现参考元素纹理和结构的高保真重建仍然具有挑战性。为了解决这个问题，我们提出了一种上下文生成策略，将元素图像视为视觉上下文，并使用修复模型将元素样式转移到像素级别的字形区域。为了进一步控制字形形状，我们设计了一个轻量级上下文感知掩模适配器（CMA）来注入形状信息。此外，无需训练的注意力重定向机制可以实现区域感知的风格控制并抑制中风幻觉。此外，还应用了边缘重绘，使边界更加自然。大量的实验表明，FontCrafter 实现了强大的零样本生成性能，特别是在保持结构和纹理保真度方面，同时还支持风格混合等灵活的控制。

Title: P-Flow: Prompting Visual Effects Generation

Authors: Rui Zhao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22091
Pdf URL: https://arxiv.org/pdf/2603.22091
Copy Paste: [[2603.22091]] P-Flow: Prompting Visual Effects Generation(https://arxiv.org/abs/2603.22091)
Keywords: generation
Abstract: Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at this https URL.
摘要：视频生成模型的最新进展显着提高了它们遵循文本提示的能力。然而，动态视觉效果的定制（定义为时间演变和外观驱动的视觉现象，如物体破碎或爆炸）仍未得到充分探索。先前关于运动定制或控制的工作主要集中在主体或相机的低级运动上，可以使用诸如运动轨迹之类的显式控制信号来引导这些运动。相比之下，动态视觉效果涉及更高级的语义，更自然地适合通过文本提示进行控制。然而，对于人类来说，制作一个准确指定这些效果的单一提示是困难且耗时的，因为它们需要复杂的时间推理和随着时间的推移进行迭代细化。为了应对这一挑战，我们提出了 P-Flow，这是一种新颖的免训练框架，用于在视频生成中定制动态视觉效果，而无需修改底层模型。通过利用视觉语言模型的语义和时间推理功能，P-Flow 执行测试时提示优化，根据参考视频的视觉效果与生成的输出之间的差异细化提示。通过迭代细化，提示不断演进，以更好地在新颖的场景中诱导出所需的动态效果。实验表明，P-Flow 实现了高保真和多样化的视觉效果定制，并且在文本到视频和图像到视频生成任务上均优于其他模型。代码可从此 https URL 获取。

Title: FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario

Authors: Hang Dai, Hongwei Fan, Han Zhang, Duojin Wu, Jiyao Zhang, Hao Dong
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2603.22102
Pdf URL: https://arxiv.org/pdf/2603.22102
Copy Paste: [[2603.22102]] FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario(https://arxiv.org/abs/2603.22102)
Keywords: generation
Abstract: The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: this https URL
摘要：对增强现实和机器人技术日益增长的需求正在推动对具有高可扩展性的铰接式对象重建的需求。然而，从离散关节状态或休闲单眼视频重建的现有设置需要非平凡的轴对齐或覆盖范围不足，限制了它们的适用性。在本文中，我们介绍了 FreeArtGS，这是一种在自由移动场景下重建关节对象的新方法，是一种设置简单且可扩展性高的新设置。 FreeArtGS 将自由移动部分分割与联合估计和端到端优化相结合，仅将单目 RGB-D 视频作为输入。通过利用现成的点跟踪和特征模型的先验进行优化，自由移动零件分割模块可以在无约束捕获下从相对运动中识别刚性零件。关节估计模块校准统一的物体到相机的姿态，并从零件分割中稳健地恢复关节类型和轴。最后，实现基于 3DGS 的端到端优化，以联合重建关节对象的视觉纹理、几何形状和关节角度。我们在两个基准和现实世界中自由移动的铰接物体上进行了实验。实验结果表明，FreeArtGS 在重建自由移动的铰接对象方面始终表现出色，并且在以前的重建设置中保持高度竞争力，证明自己是现实资产生成的实用且有效的解决方案。项目页面位于：此 https URL

Title: DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

Authors: Xin Cai, Zhiyuan You, Zhoutong Zhang, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22125
Pdf URL: https://arxiv.org/pdf/2603.22125
Copy Paste: [[2603.22125]] DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment(https://arxiv.org/abs/2603.22125)
Keywords: generation
Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbf{D}etail-\textbf{A}ligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first $C$ channels come directly from the pretrained VAE at a base resolution, while an additional $D$ channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables $1024 \times 1024$ image generation with Stable Diffusion 3.5 using only $32 \times 32$ tokens, $4\times$ fewer than the original model, within 5 H100-days. It further unlocks $2048 \times 2048$ generation with SD3.5, achieving a $6\times$ speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.
摘要：减少标记数量对于潜在扩散模型的有效训练和推理至关重要，尤其是在高分辨率下。一种常见的策略是构建每个标记具有更多通道的高压缩图像标记器。然而，当仅进行重建训练时，高维潜在空间通常会失去有意义的结构，从而使扩散训练变得更加困难。现有的方法通过语义对齐或选择性丢弃等额外目标来解决这个问题，但通常需要昂贵的扩散再训练。然而，预训练的扩散模型已经表现出结构化的低维潜在空间；因此，一个更简单的想法是在保留这种结构的同时扩展潜在维度。因此，我们提出 \textbf{D}etail-\textbf{A}ligned VAE，它只需对预训练扩散主干进行轻量级调整即可提高预训练 VAE 的压缩比。 DA-VAE 使用显式潜在布局：第一个 $C$ 通道直接来自基本分辨率的预训练 VAE，而额外的 $D$ 通道编码更高分辨率的细节。一种简单的细节对齐机制可以鼓励扩展的潜在空间保留原始空间的结构。通过热启动微调策略，我们的方法在 5 个 H100 天内，仅使用 $32 \times 32$ 代币，通过 Stable Diffusion 3.5 即可生成 $1024 \times 1024$ 图像，比原始模型少 $4\times$。它还通过 SD3.5 进一步解锁了 2048 美元×2048 美元的一代，在保持图像质量的同时实现了 6 美元×2048 美元的加速。我们还在 ImageNet 上定量验证了该方法及其设计选择。

Title: Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

Authors: Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo, Antonio Peris, Carlos Kuchkovsky
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2603.22184
Pdf URL: https://arxiv.org/pdf/2603.22184
Copy Paste: [[2603.22184]] Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?(https://arxiv.org/abs/2603.22184)
Keywords: generation
Abstract: Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.
摘要：大型语言模型 (LLM) 的最新进展使得越来越多的编程任务实现自动化，包括科学和工程领域的代码生成。在快速发展的软件生态系统中，例如量子软件开发，框架暴露了复杂的抽象，一个中心问题是如何最好地将领域知识融入到基于 LLM 的助手中，同时随着库的发展保持可维护性。在这项工作中，我们使用 Qiskit-HumanEval 基准研究 Qiskit 代码生成的专业化策略。我们将之前的工作中引入的参数专用微调基线与一系列最近通过检索增强生成（RAG）和基于代理的推理与执行反馈增强的通用法学硕士进行了比较。我们的结果表明，现代通用法学硕士始终优于参数专业基线。虽然经过微调的模型在 Qiskit-HumanEval 上实现了约 47% pass@1，但最近的通用模型在零样本和检索增强设置下达到了 60-65%，而与迭代执行反馈代理结合使用时，最强评估模型的通过率高达 85%，这意味着比零样本通用性能提高了 20% 以上，比参数专用基线提高了 35% 以上。代理执行反馈可产生最一致的改进，尽管运行时成本会增加，而 RAG 则提供适度且依赖于模型的增益。这些发现表明，无需特定领域的微调即可实现性能提升，而是依赖推理时间增强，从而为 LLM 辅助的量子软件开发提供更灵活和可维护的方法。

Title: Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

Authors: Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22187
Pdf URL: https://arxiv.org/pdf/2603.22187
Copy Paste: [[2603.22187]] Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement(https://arxiv.org/abs/2603.22187)
Keywords: generation, generative
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at this https URL.
摘要：多模态大语言模型 (MLLM) 的最新进展使得能够根据自然语言描述自动生成结构化布局。现有方法通常遵循纯代码范例，生成代码来表示布局，然后由图形引擎渲染以生成最终图像。然而，他们对渲染的视觉结果视而不见，难以保证可读性和美观性。在本文中，我们将视觉反馈确定为布局生成中的关键因素，并提出了视觉反馈布局模型（VFLM），这是一种利用视觉反馈迭代细化的自我改进框架。 VFLM 能够执行自适应反射生成，它利用视觉信息来反映先前的问题并迭代生成输出，直到达到满意的质量。它是通过强化学习和基于视觉的奖励模型（结合 OCR 准确性）来实现的。通过仅奖励最终生成的结果，我们可以有效地激发模型的迭代和反思生成能力。跨多个基准的实验表明，VFLM 始终优于先进的 MLLM、现有布局模型和纯代码基线，从而确立了视觉反馈对于面向设计的 MLLM 的关键作用。我们的代码和数据可在此 https URL 中获取。

Title: PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation

Authors: Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Li Yi, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22193
Pdf URL: https://arxiv.org/pdf/2603.22193
Copy Paste: [[2603.22193]] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation(https://arxiv.org/abs/2603.22193)
Keywords: generation
Abstract: Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
摘要：手-物体交互 (HOI) 重建和合成正在成为实体 AI 和 AR/VR 的核心。然而，尽管取得了快速进展，现有的 HOI 生成研究仍然分散在三个不相交的轨道上：（1）仅姿势合成，预测 MANO 轨迹而不产生像素； (2) 单图像 HOI 生成，通过掩模或 2D 线索产生幻觉外观，但缺乏动态； (3) 视频生成方法需要整个姿势序列和地面实况第一帧作为输入，从而阻碍了真正的模拟到真实部署。受到 Joo 等人的哲学的启发。 (2018)，我们认为 HOI 生成需要一个统一的引擎，将姿势、外观和运动整合到一个连贯的框架内。因此，我们引入 PAM：用于可控 HOI 视频生成的姿势-外观-运动引擎。我们的引擎的性能通过以下方式进行验证：(1) 在 DexYCB 上，我们获得了 29.13 的 FVD（相对于 InterDyn 的 38.83）和 19.37 毫米的 MPJPE（相对于 CosHand 的 30.05 毫米），同时生成比 256x256 和 256x384 基线更高分辨率的 480x720 视频。 (2) 在 OAKINK2 上，我们的完整多条件模型将 FVD 从 68.76 提高到 46.31。 (3) DexYCB 对输入条件的消融表明，结合深度、分割和关键点始终能产生最佳结果。 (4) 对于使用 SimpleHand 的下游手部姿势估计任务，使用 3,400 个合成视频（207k 帧）增强训练使得模型仅使用 50% 的真实数据加上我们的合成数据进行训练，以匹配 100% 的真实基线。

Title: Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

Authors: Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.22206
Pdf URL: https://arxiv.org/pdf/2603.22206
Copy Paste: [[2603.22206]] Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs(https://arxiv.org/abs/2603.22206)
Keywords: generation
Abstract: Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2--2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.
摘要：多代理应用程序通常以多阶段工作流的形式执行复杂的任务，其中每个阶段都是一个 LLM 调用，其输出成为后续步骤上下文的一部分。现有的 LLM 服务系统很大程度上假设具有相同模型副本的同质集群。这种设计忽视了异构部署的潜力，其中不同规模和功能的模型可以在延迟和性能之间实现更精细的权衡。然而，异构性给具有不同吞吐量和性能的模型之间的调度带来了新的挑战。我们推出了 Chimera，这是一种在异构 LLM 集群上服务的多代理工作流的预测调度系统，可共同改善端到端延迟和任务性能。 Chimera 应用语义路由来估计每个请求的每个模型的置信度分数，预测工作流的总剩余输出长度，并使用动态预测令牌量来估计每个模型的拥塞以进行负载平衡。我们使用多个异构 LLM 配置在代码生成和数学推理的代表性代理工作流程上评估 Chimera。在可比较的设置中，Chimera 追踪了最佳延迟性能前沿，与包括 vLLM 在内的竞争基准相比，将端到端延迟减少了 1.2--2.4$\times$，并将任务性能平均提高了 8.0-9.5 个百分点。

Title: Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Authors: Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22212
Pdf URL: https://arxiv.org/pdf/2603.22212
Copy Paste: [[2603.22212]] Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models(https://arxiv.org/abs/2603.22212)
Keywords: generation, generative
Abstract: Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
摘要：基于视频的世界模型已经沿着两个主要范式出现：视频生成和 3D 重建。然而，现有的评估基准要么狭隘地关注生成模型的视觉保真度和文本视频对齐，要么依赖于从根本上忽略时间动态的静态 3D 重建指标。我们认为世界建模的未来在于 4D 生成，它联合建模空间结构和时间演化。在这个范式中，核心能力是交互响应：能够忠实地反映交互动作如何驱动跨空间和时间的状态转换。然而，现有的基准还没有系统地评估这个关键维度。为了弥补这一差距，我们提出了 Omni--WorldBench，这是一个专门设计用于评估 4D 设置中世界模型的交互响应能力的综合基准。 Omni--WorldBench 包含两个关键组件： Omni--WorldSuite，跨越不同交互级别和场景类型的系统提示套件； Omni--Metrics，一个基于代理的评估框架，通过测量交互行为对最终结果和中间状态演化轨迹的因果影响来量化世界建模能力。我们对跨多个范式的 18 个代表性世界模型进行了广泛的评估。我们的分析揭示了当前世界模型在交互响应方面的关键局限性，为未来的研究提供了可行的见解。 Omni-WorldBench 将公开发布，以促进交互式 4D 世界建模的进步。

Title: SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Authors: Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.22213
Pdf URL: https://arxiv.org/pdf/2603.22213
Copy Paste: [[2603.22213]] SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection(https://arxiv.org/abs/2603.22213)
Keywords: generation
Abstract: While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at this https URL.
摘要：虽然大型语言模型（LLM）是在大量数据上进行预训练的，但它们的知识覆盖范围在专业的数据稀缺领域仍然不完整，这促使人们广泛努力研究用于知识注入的合成数据生成。我们提出了 SPA（Scaling Prompt-engineered Augmentation），这是一个简单但难以超越的基线，它使用一小组精心设计的提示来生成用于知识注入的大规模合成数据。通过系统比较，我们发现 SPA 优于几个强大的基线。此外，我们还发现了现有方法的两个关键局限性：（1）虽然基于强化学习的方法可以在小规模下提高基于 LLM 的数据增强的代币效率，但随着数据规模的扩大，它们会遭受多样性崩溃，导致收益递减； (2)虽然多阶段提示可能优于简单的增强方法，但在仔细调整提示后，它们的优势可能会消失。我们的结果表明，对于知识注入，仔细的提示设计与直接的大规模增强相结合可能会非常有效，我们希望 SPA 能够成为该领域未来研究的强有力的基线。我们的代码可以在这个 https URL 上找到。

Title: Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting

Authors: Qilin Wang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.22219
Pdf URL: https://arxiv.org/pdf/2603.22219
Copy Paste: [[2603.22219]] Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting(https://arxiv.org/abs/2603.22219)
Keywords: generative
Abstract: Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.
摘要：现代时间序列预测几乎完全是通过对单一历史轨迹的被动观察来评估的，这使得关于模型对非平稳性的稳健性的主张从根本上是不可证伪的。我们提出向干预主义、精确统计基准的范式转变。通过系统地将校准的高斯观测噪声滴定到已知的混沌和随机动力系统中，我们将预测从黑盒序列匹配游戏转变为精确的分布推理任务。由于底层数据生成过程和噪声方差在数学上是明确的，因此评估可以依赖于精确的负对数似然和校准的分布测试，而不是启发式近似。为了充分利用这个框架，我们将 Fern 架构扩展到概率生成模型，该模型本身参数化对称正定 (SPD) 锥体，输出校准的联合协方差结构，而没有通用雅可比建模的计算瓶颈。在这种严格的评估下，我们发现最先进的零样本基础模型的行为与上下文鹦鹉学舌机制一致，在非平稳状态转移和噪声升高的情况下系统性地失败。相比之下，Fern 明确地捕获了底层动力学的不变测度和多元几何形状，在大规模序列匹配模型崩溃的地方精确地保持结构保真度和统计上的精确校准。

Title: SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Authors: Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22228
Pdf URL: https://arxiv.org/pdf/2603.22228
Copy Paste: [[2603.22228]] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation(https://arxiv.org/abs/2603.22228)
Keywords: generation
Abstract: Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
摘要：通过强化学习 (RL) 生成文本到图像 (T2I) 的最新进展受益于评估语义对齐和视觉质量的奖励模型。然而，大多数现有的奖励模型对细粒度空间关系的关注有限，通常生成的图像总体看起来合理，但对象定位不准确。在这项工作中，我们提出了 \textbf{SpatialReward}，这是一种可验证的奖励模型，明确设计用于评估生成图像中的空间布局。 SpatialReward 采用多级管道：\emph{Prompt Decomposer} 从自由格式的提示中提取实体、属性和空间元数据；专家探测器提供物体位置和属性的准确视觉基础；视觉语言模型将思维链推理应用于基础观察，以评估复杂的空间关系，这对于基于规则的方法来说是具有挑战性的。为了更全面地评估生成图像中的空间关系，我们引入了 \textbf{SpatRelBench}，这是一个涵盖对象属性、方向、对象间关系和渲染文本位置的基准。 Stable Diffusion 和 FLUX 的实验表明，将 SpatialReward 纳入 RL 训练中可以持续提高空间一致性和整体生成质量，结果更符合人类判断。这些发现表明，可验证的奖励模型在文本到图像生成模型中实现更准确和可控的优化方面具有巨大的潜力。

Title: Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Authors: Changxiao Cai, Gen Li
Subjects: cs.LG, cs.AI, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2603.22248
Pdf URL: https://arxiv.org/pdf/2603.22248
Copy Paste: [[2603.22248]] Confidence-Based Decoding is Provably Efficient for Diffusion Language Models(https://arxiv.org/abs/2603.22248)
Keywords: generation
Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} -- which determines the order and number of tokens generated at each iteration -- critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.
摘要：扩散语言模型 (DLM) 已成为语言建模自回归 (AR) 模型的一种有前景的替代方案，允许灵活的生成顺序和并行生成多个标记。然而，这种灵活性带来了 AR 模型中不存在的挑战：\emph{解码策略}（决定每次迭代生成的令牌的顺序和数量）严重影响采样效率。在实践中探索的解码策略中，基于置信度的方法（根据预测置信度自适应地选择要解密的标记和数量）已显示出强大的经验性能。尽管取得了这一成功，我们对基于置信度的解码的理论理解仍然有限。在这项工作中，我们开发了第一个用于 DLM 中基于置信度的解码的理论分析框架。我们关注基于熵和的策略，该策略在每次迭代中继续揭露标记，直到累积熵超过阈值，并表明它在 KL 散度中实现了 $\varepsilon$ 精确采样，预期迭代次数 $\widetilde O(H(X_0)/\varepsilon)$，其中 $H(X_0)$ 表示目标数据分布的熵。值得注意的是，当数据分布相对于序列长度具有较低熵时，该策略会产生显着的采样加速，同时自动适应数据的内在复杂性，而不需要先验知识或超参数调整。总的来说，我们的结果为基于置信度的解码提供了理论基础，并且可以为 DLM 设计更有效的解码策略提供信息。

Title: GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning

Authors: Yixuan Luo, Feng Qiao, Zhexiao Xiong, Yanjing Li, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22270
Pdf URL: https://arxiv.org/pdf/2603.22270
Copy Paste: [[2603.22270]] GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning(https://arxiv.org/abs/2603.22270)
Keywords: generation, generative
Abstract: Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
摘要：光流估计是计算机视觉中的一个基本问题，但对昂贵的真实注释的依赖限制了监督方法的可扩展性。尽管无监督和半监督方法缓解了这个问题，但它们经常受到基于亮度恒定性和平滑度假设的不可靠监督信号的影响，导致在复杂的现实场景中运动估计不准确。为了克服这些限制，我们引入了 \textbf{\modelname}，这是一种新颖的框架，可以合成大规模、完美对齐的帧流数据对，用于无需人工注释的监督光流训练。具体来说，我们的方法利用预先训练的深度估计网络来生成伪光流，作为下一帧生成模型的条件输入，该模型经过训练以生成高保真、像素对齐的后续帧。该过程能够创建丰富、高质量的具有精确运动对应的合成数据。此外，我们提出了一种 \textit{不一致像素过滤} 策略，该策略可以识别并删除生成帧中不可靠的像素，从而有效增强对现实数据集的微调性能。在 KITTI2012、KITTI2015 和 Sintel 上进行的大量实验表明，与现有的无监督和半监督方法相比， \textbf{\modelname} 取得了有竞争力或更好的结果，凸显了其作为可扩展且无注释的光流学习解决方案的潜力。我们将在接受后发布我们的代码。

Title: DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

Authors: Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22271
Pdf URL: https://arxiv.org/pdf/2603.22271
Copy Paste: [[2603.22271]] DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution(https://arxiv.org/abs/2603.22271)
Keywords: super-resolution, generation
Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
摘要：基于扩散的视频超分辨率（VSR）最近取得了显着的保真度，但仍然受到采样成本过高的影响。虽然分布匹配蒸馏 (DMD) 可以加速扩散模型的一步生成，但将其直接应用于 VSR 通常会导致训练不稳定以及监督退化和不足。为了解决这些问题，我们提出了 DUO-VSR，这是一种基于双流蒸馏策略的三阶段框架，它将分布匹配和对抗性监督统一到一步 VSR。首先，采用渐进引导蒸馏初始化，通过轨迹保持蒸馏来稳定后续训练。接下来，双流蒸馏联合优化 DMD 和真假分数特征 GAN (RFS-GAN) 流，后者利用真假分数模型的判别特征提供互补的对抗性监督。最后，偏好引导的细化阶段进一步使学生与感知质量偏好保持一致。大量实验表明，与之前的一步 VSR 方法相比，DUO-VSR 实现了卓越的视觉质量和效率。

Title: Repurposing Geometric Foundation Models for Multi-view Diffusion

Authors: Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22275
Pdf URL: https://arxiv.org/pdf/2603.22275
Copy Paste: [[2603.22275]] Repurposing Geometric Foundation Models for Multi-view Diffusion(https://arxiv.org/abs/2603.22275)
Keywords: generation, generative
Abstract: While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.
摘要：虽然生成潜在空间的最新进展推动了单图像生成的实质性进展，但新视图合成（NVS）的最佳潜在空间在很大程度上仍未被探索。特别是，NVS 需要跨视点进行几何一致的生成，但现有方法通常在与视图无关的 VAE 潜在空间中运行。在本文中，我们提出了几何潜在扩散（GLD），这是一个框架，它将几何基础模型的几何一致特征空间重新用作多视图扩散的潜在空间。我们证明这些特征不仅支持高保真 RGB 重建，而且还编码强跨视图几何对应关系，为 NVS 提供了非常适合的潜在空间。我们的实验表明，GLD 在 2D 图像质量和 3D 一致性指标上均优于 VAE 和 RAE，同时与 VAE 潜在空间相比，训练速度提高了 4.4 倍以上。值得注意的是，GLD 与利用大规模文本到图像预训练的最先进方法仍然具有竞争力，尽管在没有这种生成预训练的情况下从头开始训练其扩散模型。

Title: Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Authors: Alexandra Zelenin, Alexandra Zhuravlyova
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.22276
Pdf URL: https://arxiv.org/pdf/2603.22276
Copy Paste: [[2603.22276]] Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels(https://arxiv.org/abs/2603.22276)
Keywords: generation
Abstract: Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
摘要：权重分解低阶自适应 (DoRA) 通过将权重大小与方向解耦来扩展 LoRA，但其前向传递需要 W + sBA 的行范数，我们调查的每个主要框架都通过具体化密集的 [d_out, d_in] 乘积 BA 来实现这种计算。在 d_in = 8192 和等级 r = 384 时，单个模块的标准在 bf16 中需要大约 512 MB 的瞬态工作内存，这使得高等级 DoRA 成本高昂，并且一旦涉及数百个适配模块和检查点，通常在常见的单 GPU 设置上不可行。我们提出了两个系统贡献。分解范数将平方范数分解为可通过 O(d_out r + r^2) 中间项计算的基项、交叉项和 Gram 项，从而消除了密集积。融合 Triton 内核将四内核 DoRA 组合压缩为单次传递，将内存流量减少约 4 倍，并使用数值稳定的形式，避免在实践中幅度尺度集中的近乎统一的重新缩放机制中发生灾难性取消。在 bf16 中 r = 384 时，在三个 NVIDIA GPU（RTX 6000 PRO、H200、B200）上的六个 8-32B 视觉语言模型 (VLM) 中，融合实现比 Hugging Face PEFT 的 DoRA 推理实现快 1.5-2.0 倍，梯度计算快 1.5-1.9 倍（不包括优化器步骤），峰值下限高达 7 GB显存。跨越四代架构（L40S、A100、RTX 6000 PRO、H200、B200、B300）的 6 个 GPU 的微基准测试证实了 1.5-2.7 倍的组合内核加速。所有模型/GPU 对的最终 Logit 余弦相似度超过 0.9999，多种子训练曲线在 2000 个步骤中匹配在 7.1 x 10^-4 平均每步损失增量内。

Title: UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Authors: Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22282
Pdf URL: https://arxiv.org/pdf/2603.22282
Copy Paste: [[2603.22282]] UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation(https://arxiv.org/abs/2603.22282)
Keywords: generation
Abstract: We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
摘要：据我们所知，我们推出了 UniMotion，这是第一个在单一架构中同时理解和生成人体运动、自然语言和 RGB 图像的统一框架。现有的统一模型仅处理受限的模态子集（例如，运动文本或静态姿势图像），并且主要依赖于离散标记化，这会引入量化误差并破坏时间连续性。 UniMotion 通过一个核心原则克服了这两个限制：将运动视为与 RGB 平等的一流连续模式。一种新颖的跨模态对齐运动 VAE (CMA-VAE) 和对称双路径嵌入器在共享的 LLM 主干内构建运动和 RGB 的并行连续路径。为了将视觉语义先验注入到运动表示中而不需要推理时的图像，我们提出了双后验 KL 对齐（DPA），它将视觉融合编码器更丰富的后验提取到纯运动编码器中。为了解决冷启动问题——仅文本监督太稀疏而无法校准新引入的运动路径——我们进一步提出了潜在重建对齐（LRA），这是一种自监督预训练策略，它使用密集运动潜在作为明确条件来共同校准嵌入器、骨干网和流头，为所有下游任务建立稳定的运动感知基础。 UniMotion 在跨越三种模式之间的任意理解、生成和编辑的七个任务中实现了最先进的性能，在跨模式合成任务上尤其具有强大的优势。

Title: End-to-End Training for Unified Tokenization and Latent Denoising

Authors: Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.22283
Pdf URL: https://arxiv.org/pdf/2603.22283
Copy Paste: [[2603.22283]] End-to-End Training for Unified Tokenization and Latent Denoising(https://arxiv.org/abs/2603.22283)
Keywords: generation, generative
Abstract: Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.
摘要：潜在扩散模型 (LDM) 通过在学习的潜在空间中运行来实现高保真合成。然而，训练最先进的 LDM 需要复杂的阶段：必须首先训练分词器，然后才能在冻结的潜在空间中训练扩散模型。我们提出 UNITE - 一种用于统一标记化和潜在扩散的自动编码器架构。 UNITE 由一个生成编码器组成，该编码器通过权重共享充当图像分词器和潜在生成器。我们的主要见解是，标记化和生成可以被视为不同条件条件下相同的潜在推理问题：标记化从完全观察到的图像中推断出潜在的，而生成则从噪声以及文本或类条件中推断出它们。受此启发，我们引入了一个单阶段训练过程，通过同一个生成编码器的两次前向传递来联合优化这两个任务。共享参数使梯度能够共同塑造潜在空间，鼓励“共同的潜在语言”。在图像和分子模态中，UNITE 在没有对抗性损失或预训练编码器（例如 DINO）的情况下实现了接近最先进的性能，在 ImageNet 256 x 256 上的基本模型和大型模型达到了 FID 2.12 和 1.73。我们通过表示对齐和压缩的镜头进一步分析生成编码器。这些结果表明，从头开始进行标记化和生成的单阶段联合训练是可行的。