2026-03-25

Title: Founder effects shape the evolutionary dynamics of multimodality in open LLM families

Authors: Manuel Cebrian
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.22287
Pdf URL: https://arxiv.org/pdf/2603.22287
Copy Paste: [[2603.22287]] Founder effects shape the evolutionary dynamics of multimodality in open LLM families(https://arxiv.org/abs/2603.22287)
Keywords: generation
Abstract: Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.
摘要：大语言模型（LLM）家族正在迅速改进，但目前尚不清楚多模态能力在开放家族中出现和传播的速度有多快。使用 Hugging Face 模型元数据和记录的谱系字段（>1.8x10^6 模型条目）的 ModelBiome AI 生态系统数据集，我们可以随着时间的推移以及记录的父子关系来量化多模态。跨模态任务早在主要开放式法学硕士家族中变得普遍之前就已经在更广泛的生态系统中广泛存在：在这些家族中，到 2023 年和 2024 年的大部分时间，多模态任务仍然很少见，然后在 2024-2025 年急剧增加，并以图像-文本视觉-语言任务为主。在主要系列中，第一个视觉语言模型 (VLM) 变体通常在首次文本生成版本发布后几个月出现，对于一些系列，滞后时间从约 1 个月 (Gemma) 到一年多，对于 GLM 约 26 个月。谱系条件转换率显示出较弱的跨类型转移：在文本生成父母的微调边缘中，只有 0.218% 产生 VLM 后代。相反，多模态主要在现有的 VLM 谱系内扩展：94.5% 的 VLM 子微调边缘源自 VLM 父代，而 4.7% 来自文本生成父代。在模型级别，大多数 VLM 版本显示为没有记录父代的新根 (~60%)，而其余的主要是 VLM 衍生的；创始人浓度分析表明谱系内快速扩增，随后是多样化。总之，这些结果表明，多模态通过罕见的创始人事件进入开放式法学硕士家族，然后在其后代谱系中迅速扩展，产生间断的采用动态，这可能会导致多模态能力的独特的、转移限制的扩展行为。

Title: Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks

Authors: Srideepika Jayaraman, Achille Fokoue, Dhaval Patel, Jayant Kalagnanam
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22294
Pdf URL: https://arxiv.org/pdf/2603.22294
Copy Paste: [[2603.22294]] Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks(https://arxiv.org/abs/2603.22294)
Keywords: generation
Abstract: Synthetic Data Generation (SDG), leveraging Large Language Models (LLMs), has recently been recognized and broadly adopted as an effective approach to improve the performance of smaller but more resource and compute efficient LLMs through fine-tuning. A key challenge in SDG is ensuring the quality and diversity of the generated data. In this paper, we analyze the diversity and distribution of generated data in the embedding space, and demonstrate a strong correlation between the density of examples within a specific neighborhood and the accuracy of predictions on examples drawn from that region. Building on this insight, we present a targeted pipeline for embedding-based sampling that enhances data diversity and consistently improves performance across several benchmarks.
摘要：利用大型语言模型 (LLM) 的合成数据生成 (SDG) 最近已被认可并广泛采用，作为一种通过微调来提高更小但资源和计算效率更高的 LLM 性能的有效方法。 SDG 的一个关键挑战是确保生成数据的质量和多样性。在本文中，我们分析了嵌入空间中生成数据的多样性和分布，并证明了特定邻域内示例的密度与从该区域提取的示例的预测准确性之间的强相关性。基于这一见解，我们提出了一个基于嵌入的采样的有针对性的管道，可增强数据多样性并持续提高多个基准的性能。

Title: Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization

Authors: Wenhao Zhao, Qiran Zou, Zhouhan Lin, Dianbo Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.22304
Pdf URL: https://arxiv.org/pdf/2603.22304
Copy Paste: [[2603.22304]] Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization(https://arxiv.org/abs/2603.22304)
Keywords: generative
Abstract: Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well-expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet-1K and ImageNet-100 benchmarks, highlighting the ProVQ's boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.
摘要：矢量量化 (VQ) 已成为许多多模态大型语言模型和扩散合成的标记化的基石。然而，现有的 VQ 范式存在一个根本性的冲突：它们在编码器捕获底层数据流形之前强制执行离散化。我们将这种现象称为“过早离散化”。为了解决这个问题，我们提出了渐进量化（ProVQ），它将量化硬度动态作为 VQ 训练中一个基本但之前被忽视的轴。通过将量化视为从连续潜在空间平滑退火到离散潜在空间的课程，ProVQ 有效地引导码本走向充分扩展的流形。大量的实验结果证明了 ProVQ 在不同模式中的广泛有效性。我们报告了在 ImageNet-1K 和 ImageNet-100 基准上改进的重建和生成性能，突出了 ProVQ 对生成建模的提升。此外，ProVQ 被证明对于复杂生物序列建模非常有效，为 StrutTokenBench 排行榜上的蛋白质结构标记化建立了新的性能上限。

Title: Full waveform inversion method based on diffusion model

Authors: Caiyun Liu, Siyang Pei, Qingfeng Yu, Jie Xiong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.22307
Pdf URL: https://arxiv.org/pdf/2603.22307
Copy Paste: [[2603.22307]] Full waveform inversion method based on diffusion model(https://arxiv.org/abs/2603.22307)
Keywords: generative
Abstract: Seismic full-waveform inversion is a core technology for obtaining high-resolution subsurface model parameters. However, its highly nonlinear characteristics and strong dependence on the initial model often lead to the inversion process getting trapped in local minima. In recent years, generative diffusion models have provided a way to regularize full-waveform inversion by learning implicit prior distributions. However, existing methods mostly use unconditional diffusion processes, ignoring the inherent physical coupling relationship between velocity and density and other physical properties. This paper proposes a full-waveform inversion method based on conditional diffusion model regularization. By improving the backbone network structure of the diffusion model, two-dimensional density information is introduced as a conditional input into the U-Net network. Experimental results show that the full-waveform inversion method based on the conditional diffusion model significantly improves the resolution and structural fidelity of the inversion results, and exhibits stronger stability and robustness when dealing with complex situations. This method effectively utilizes density information to constrain the inversion and has good practical application value. Keywords: Deep learning; Diffusion model; Full waveform inversion.
摘要：地震全波形反演是获取高分辨率地下模型参数的核心技术。然而，其高度非线性特性和对初始模型的强烈依赖性往往导致反演过程陷入局部极小值。近年来，生成扩散模型提供了一种通过学习隐式先验分布来正则化全波形反演的方法。然而，现有方法大多采用无条件扩散过程，忽略了速度和密度以及其他物理性质之间固有的物理耦合关系。提出一种基于条件扩散模型正则化的全波形反演方法。通过改进扩散模型的主干网络结构，将二维密度信息作为条件输入引入到U-Net网络中。实验结果表明，基于条件扩散模型的全波形反演方法显着提高了反演结果的分辨率和结构保真度，在处理复杂情况时表现出更强的稳定性和鲁棒性。该方法有效利用密度信息来约束反演，具有良好的实际应用价值。关键词：深度学习；扩散模型；全波形反演。

Title: UniFluids: Unified Neural Operator Learning with Conditional Flow-matching

Authors: Haosen Li, Qi Meng, Jiahao Li, Rui Zhang, Ruihua Song, Liang Ma, Zhi-Ming Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22309
Pdf URL: https://arxiv.org/pdf/2603.22309
Copy Paste: [[2603.22309]] UniFluids: Unified Neural Operator Learning with Conditional Flow-matching(https://arxiv.org/abs/2603.22309)
Keywords: generation
Abstract: Partial differential equation (PDE) simulation holds extensive significance in scientific research. Currently, the integration of deep neural networks to learn solution operators of PDEs has introduced great potential. In this paper, we present UniFluids, a conditional flow-matching framework that harnesses the scalability of diffusion Transformer to unify learning of solution operators across diverse PDEs with varying dimensionality and physical variables. Unlike the autoregressive PDE foundation models, UniFluids adopts flow-matching to achieve parallel sequence generation, making it the first such approach for unified operator learning. Specifically, the introduction of a unified four-dimensional spatiotemporal representation for the heterogeneous PDE datasets enables joint training and conditional encoding. Furthermore, we find the effective dimension of the PDE dataset is much lower than its patch dimension. We thus employ $x$-prediction in the flow-matching operator learning, which is verified to significantly improve prediction accuracy. We conduct a large-scale evaluation of UniFluids on several PDE datasets covering spatial dimensions 1D, 2D and 3D. Experimental results show that UniFluids achieves strong prediction accuracy and demonstrates good scalability and cross-scenario generalization capability. The code will be released later.
摘要：偏微分方程（PDE）模拟在科学研究中具有广泛的意义。目前，深度神经网络的集成来学习偏微分方程的解算子已经带来了巨大的潜力。在本文中，我们提出了 UniFluids，这是一种条件流匹配框架，它利用扩散 Transformer 的可扩展性来统一具有不同维度和物理变量的不同偏微分方程中解算子的学习。与自回归 PDE 基础模型不同，UniFluids 采用流匹配来实现并行序列生成，使其成为第一个统一算子学习的方法。具体来说，为异构 PDE 数据集引入统一的四维时空表示可以实现联合训练和条件编码。此外，我们发现 PDE 数据集的有效维度远低于其补丁维度。因此，我们在流匹配算子学习中采用$x$预测，经验证可以显着提高预测精度。我们在涵盖 1D、2D 和 3D 空间维度的多个 PDE 数据集上对 UniFluids 进行了大规模评估。实验结果表明，UniFluids 具有很强的预测精度，并表现出良好的可扩展性和跨场景泛化能力。代码稍后会发布。

Title: ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography

Authors: Jing Xu, Weiqiang Wang, Cunjian Chen, Jun Liu, Qiuhong Ke
Subjects: cs.LG, cs.AI, cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2603.22316
Pdf URL: https://arxiv.org/pdf/2603.22316
Copy Paste: [[2603.22316]] ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography(https://arxiv.org/abs/2603.22316)
Keywords: generation
Abstract: Group dance generation from music requires synchronizing multiple dancers while maintaining spatial coordination, making it highly relevant to applications such as film production, gaming, and animation. Recent group dance generation models have achieved promising generation quality, but they remain difficult to deploy in interactive scenarios due to bidirectional attention dependencies. As the number of dancers and the sequence length increase, the attention computation required for aligning music conditions with motion sequences grows quadratically, leading to reduced efficiency and increased risk of motion collisions. Effectively modeling dense spatial-temporal interactions is therefore essential, yet existing methods often struggle to capture such complexity, resulting in limited scalability and unstable multi-dancer coordination. To address these challenges, we propose ST-GDance++, a scalable framework that decouples spatial and temporal dependencies to enable efficient and collision-aware group choreography generation. For spatial modeling, we introduce lightweight distance-aware graph convolutions to capture inter-dancer relationships while reducing computational overhead. For temporal modeling, we design a diffusion noise scheduling strategy together with an efficient temporal-aligned attention mask, enabling stream-based generation for long motion sequences and improving scalability in long-duration scenarios. Experiments on the AIOZ-GDance dataset show that ST-GDance++ achieves competitive generation quality with significantly reduced latency compared to existing methods.
摘要：从音乐生成群舞需要同步多个舞者，同时保持空间协调，这使得它与电影制作、游戏和动画等应用高度相关。最近的群舞生成模型已经实现了有希望的生成质量，但由于双向注意依赖性，它们仍然难以部署在交互式场景中。随着舞者数量和序列长度的增加，将音乐条件与运动序列对齐所需的注意力计算呈二次方增长，导致效率降低并增加运动碰撞的风险。因此，有效地建模密集的时空交互至关重要，但现有的方法往往难以捕捉这种复杂性，导致可扩展性有限和多舞者协调不稳定。为了应对这些挑战，我们提出了 ST-GDance++，这是一个可扩展的框架，可以解耦空间和时间依赖性，以实现高效且具有碰撞感知的群组编排生成。对于空间建模，我们引入了轻量级距离感知图卷积来捕获舞者之间的关系，同时减少计算开销。对于时间建模，我们设计了一种扩散噪声调度策略以及一个高效的时间对齐注意掩模，从而实现了长运动序列的基于流的生成，并提高了长时间场景中的可扩展性。 AIOZ-GDance 数据集上的实验表明，与现有方法相比，ST-GDance++ 实现了有竞争力的生成质量，并且延迟显着降低。

Title: Sparsely-Supervised Data Assimilation via Physics-Informed Schrödinger Bridge

Authors: Dohyun Bu, Chanho Kim, Seokun Choi, Jong-Seok Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22319
Pdf URL: https://arxiv.org/pdf/2603.22319
Copy Paste: [[2603.22319]] Sparsely-Supervised Data Assimilation via Physics-Informed Schrödinger Bridge(https://arxiv.org/abs/2603.22319)
Keywords: generative
Abstract: Data assimilation (DA) for systems governed by partial differential equations (PDE) aims to reconstruct full spatiotemporal fields from sparse high-fidelity (HF) observations while respecting physical constraints. While full-grid low-fidelity (LF) simulations provide informative priors in multi-fidelity settings, recovering an HF field consistent with both sparse observations and the governing PDE typically requires per-instance test-time optimization, which becomes a major bottleneck in time-critical applications. To alleviate this, amortized reconstruction using generative models has recently been proposed; however, such approaches rely on full-field HF supervision during training, which is often impractical in real-world settings. From a more realistic perspective, we propose the Physics-Informed Conditional Schrödinger Bridge (PICSB), which transports an informative LF prior toward an observation-conditioned HF posterior without any additional inference-time guidance. To enable learning without HF endpoints, PICSB employs an iterative surrogate-endpoint refresh scheme, and directly incorporates PDE residuals into the training objective while enforcing observations via hard conditioning throughout sampling. Experiments on fluid PDE benchmarks demonstrate that PICSB enables extremely fast spatiotemporal field reconstruction while maintaining competitive accuracy under sparse HF supervision.
摘要：由偏微分方程 (PDE) 控制的系统的数据同化 (DA) 旨在从稀疏的高保真 (HF) 观测中重建完整的时空场，同时尊重物理约束。虽然全网格低保真 (LF) 模拟在多保真设置中提供了信息丰富的先验，但恢复与稀疏观测和控制 PDE 一致的 HF 场通常需要每个实例的测试时间优化，这成为时间关键型应用中的主要瓶颈。为了缓解这个问题，最近提出了使用生成模型的摊销重建；然而，这种方法依赖于训练期间的全场高频监督，这在现实环境中通常是不切实际的。从更现实的角度来看，我们提出了物理信息条件薛定谔桥（PICSB），它将信息丰富的 LF 先验传输到观察条件的 HF 后验，而无需任何额外的推理时间指导。为了在没有 HF 端点的情况下进行学习，PICSB 采用迭代代理端点刷新方案，并直接将 PDE 残差纳入训练目标，同时在整个采样过程中通过硬条件强制执行观察。流体 PDE 基准实验表明，PICSB 能够实现极快的时空场重建，同时在稀疏 HF 监督下保持有竞争力的精度。

Title: MCLR: Improving Conditional Modeling in Visual Generative Models via Inter-Class Likelihood-Ratio Maximization and Establishing the Equivalence between Classifier-Free Guidance and Alignment Objectives

Authors: Xiang Li, Yixuan Jia, Xiao Li, Jeffrey A. Fessler, Rongrong Wang, Qing Qu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.22364
Pdf URL: https://arxiv.org/pdf/2603.22364
Copy Paste: [[2603.22364]] MCLR: Improving Conditional Modeling in Visual Generative Models via Inter-Class Likelihood-Ratio Maximization and Establishing the Equivalence between Classifier-Free Guidance and Alignment Objectives(https://arxiv.org/abs/2603.22364)
Keywords: generative
Abstract: Diffusion models have achieved state-of-the-art performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. From a theoretical perspective, diffusion models trained with standard denoising score matching (DSM) are expected to recover the target data distribution, raising the question of why inference-time guidance is necessary in practice. In this work, we ask whether the DSM training objective can be modified in a principled manner such that standard reverse-time sampling, without inference-time guidance, yields effects comparable to CFG. We identify insufficient inter-class separation as a key limitation of standard diffusion models. To address this, we propose MCLR, a principled alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Models fine-tuned with MCLR exhibit CFG-like improvements under standard sampling, achieving comparable qualitative and quantitative gains without requiring inference-time guidance. Beyond empirical benefits, we provide a theoretical result showing that the CFG-guided score is exactly the optimal solution to a weighted MCLR objective. This establishes a formal equivalence between classifier-free guidance and alignment-based objectives, offering a mechanistic interpretation of CFG.
摘要：扩散模型在生成建模中取得了最先进的性能，但它们的成功通常在很大程度上依赖于无分类器指导（CFG），这是一种修改采样轨迹的推理时间启发式方法。从理论角度来看，使用标准去噪分数匹配（DSM）训练的扩散模型有望恢复目标数据分布，这就提出了为什么在实践中需要推理时间指导的问题。在这项工作中，我们询问是否可以以原则性的方式修改 DSM 训练目标，以便在没有推理时间指导的情况下，标准逆时采样产生与 CFG 相当的效果。我们认为类间分离不足是标准扩散模型的一个关键限制。为了解决这个问题，我们提出了 MCLR，这是一个有原则的对齐目标，它在训练期间明确最大化类间似然比。使用 MCLR 微调的模型在标准采样下表现出类似 CFG 的改进，无需推理时间指导即可实现可比较的定性和定量增益。除了经验效益之外，我们还提供了理论结果，表明 CFG 引导的分数正是加权 MCLR 目标的最佳解决方案。这在无分类器指导和基于对齐的目标之间建立了形式上的等价性，提供了 CFG 的机械解释。

Title: Three Creates All: You Only Sample 3 Steps

Authors: Yuren Cai, Guangyi Wang, Zongqing Li, Li Li, Zhihui Liu, Songzhi Su
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.22375
Pdf URL: https://arxiv.org/pdf/2603.22375
Copy Paste: [[2603.22375]] Three Creates All: You Only Sample 3 Steps(https://arxiv.org/abs/2603.22375)
Keywords: generation
Abstract: Diffusion models deliver high-fidelity generation but remain slow at inference time due to many sequential network evaluations. We find that standard timestep conditioning becomes a key bottleneck for few-step sampling. Motivated by layer-dependent denoising dynamics, we propose Multi-layer Time Embedding Optimization (MTEO), which freeze the pretrained diffusion backbone and distill a small set of step-wise, layer-wise time embeddings from reference trajectories. MTEO is plug-and-play with existing ODE solvers, adds no inference-time overhead, and trains only a tiny fraction of parameters. Extensive experiments across diverse datasets and backbones show state-of-the-art performance in the few-step sampling and substantially narrow the gap between distillation-based and lightweight methods. Code will be available.
摘要：扩散模型可提供高保真生成，但由于许多顺序网络评估，推理时仍然很慢。我们发现标准时间步条件成为少步采样的关键瓶颈。受层相关的去噪动力学的推动，我们提出了多层时间嵌入优化（MTEO），它冻结了预训练的扩散主干，并从参考轨迹中提取了一小组逐步的、逐层的时间嵌入。 MTEO 可与现有 ODE 求解器即插即用，不会增加推理时间开销，并且仅训练一小部分参数。跨不同数据集和骨干网的广泛实验显示了几步采样中最先进的性能，并大大缩小了基于蒸馏的方法和轻量级方法之间的差距。代码将可用。

Title: OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction

Authors: Hamidreza Aftabi, Faye Yu, Brooke Switzer, Zachary Fishman, Eitan Prisman, Antony Hodgson, Cari Whyne, Sidney Fels, Michael Hardisty
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22421
Pdf URL: https://arxiv.org/pdf/2603.22421
Copy Paste: [[2603.22421]] OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction(https://arxiv.org/abs/2603.22421)
Keywords: generative
Abstract: Predicting long-term bone remodeling after mandibular reconstruction would be of great clinical benefit, yet standard generative models struggle to maintain trajectory-level consistency and anatomical fidelity over long horizons. We introduce OsteoFlow, a flow-based framework predicting Year-1 post-operative CT scans from Day-5 scans. Our core contribution is Lyapunov-guided trajectory distillation: Unlike one-step distillation, our method distills a continuous trajectory over transport time from a registration-derived stationary velocity field teacher. Combined with a resection-aware image loss, this enforces geometric correspondence without sacrificing generative capacity. Evaluated on 344 paired regions of interest, OsteoFlow significantly outperforms state of-the-art baselines, reducing mean absolute error in the surgical resection zone by ~20%. This highlights the promise of trajectory distillation for long-term prediction. Code is available on GitHub: OsteoFlow.
摘要：预测下颌重建后的长期骨重塑将具有巨大的临床益处，但标准生成模型难以长期保持轨迹水平的一致性和解剖保真度。我们引入了 OsteoFlow，这是一种基于流程的框架，可根据第 5 天的扫描预测术后第 1 年的 CT 扫描。我们的核心贡献是李亚普诺夫引导的轨迹蒸馏：与一步蒸馏不同，我们的方法从配准导出的静止速度场教师中蒸馏出传输时间上的连续轨迹。与切除感知图像丢失相结合，这可以在不牺牲生成能力的情况下强制实现几何对应。对 344 个配对感兴趣区域进行评估后，OsteoFlow 显着优于最先进的基线，将手术切除区域的平均绝对误差降低了约 20%。这凸显了轨迹蒸馏对于长期预测的前景。代码可在 GitHub 上找到：OsteoFlow。

Title: MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Authors: Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, Conghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22458
Pdf URL: https://arxiv.org/pdf/2603.22458
Copy Paste: [[2603.22458]] MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding(https://arxiv.org/abs/2603.22458)
Keywords: generation
Abstract: Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.
摘要：光学字符识别 (OCR) 已从行级转录发展到结构化文档解析，要求模型恢复包含布局、表格和公式的长格式序列。尽管视觉语言模型最近取得了进展，但大多数现有系统都依赖于自回归解码，这会引入顺序延迟并放大长文档中的错误传播。在这项工作中，我们从逆渲染的角度重新审视文档 OCR，认为从左到右的因果生成是序列化的产物，而不是任务的固有属性。受这一见解的启发，我们提出了 MinerU-Diffusion，这是一种基于扩散的统一框架，在视觉条件下用并行扩散去噪取代自回归顺序解码。 MinerU-Diffusion 采用分块扩散解码器和不确定性驱动的课程学习策略来实现稳定的训练和高效的长序列推理。大量实验表明，与自回归基线相比，MinerU-Diffusion 持续提高了鲁棒性，同时解码速度提高了 3.2 倍。对所提出的 Semantic Shuffle 基准的评估进一步证实了其对语言先验的依赖减少和视觉 OCR 能力更强。

Title: Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing

Authors: Weitong Cai, Hang Zhang, Yukai Huang, Shitong Sun, Jiankang Deng, Songcen Xu, Jifei Song, Zhensong Zhang
Subjects: cs.CV, cs.AI, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2603.22466
Pdf URL: https://arxiv.org/pdf/2603.22466
Copy Paste: [[2603.22466]] Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing(https://arxiv.org/abs/2603.22466)
Keywords: generation
Abstract: Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.
摘要：始终在线的传感对于下一代边缘/可穿戴人工智能系统至关重要，但对于资源有限的移动和边缘平台来说，连续的高保真 RGB 视频捕获仍然过于昂贵。我们提出了一种高效的流媒体视频理解的新范例：始终灰度，按需颜色。通过初步研究，我们发现颜色并不总是必要的。当通过连续灰度流保留时间结构时，稀疏 RGB 帧足以实现可比较的性能。基于这一见解，我们提出了 ColorTrigger，这是一种无需培训的在线触发器，可基于窗口灰度亲和力分析选择性地激活颜色捕获。 ColorTrigger 专为实时边缘部署而设计，使用轻量级二次编程来因果检测色度冗余，再加上信用预算控制和动态令牌路由，共同降低感知和推理成本。在流媒体视频理解基准测试中，ColorTrigger 仅使用 8.1% RGB 帧，即可实现 91.6% 的全彩基准性能，展示了自然视频中的大量色彩冗余，并在资源受限的设备上实现了实用的始终在线视频传感。

Title: Tiny Inference-Time Scaling with Latent Verifiers

Authors: Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2603.22492
Pdf URL: https://arxiv.org/pdf/2603.22492
Copy Paste: [[2603.22492]] Tiny Inference-Time Scaling with Latent Verifiers(https://arxiv.org/abs/2603.22492)
Keywords: generation, generative
Abstract: Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.
摘要：推理时间缩放已成为通过使用验证器来评分和选择候选输出来改进测试时生成模型的有效方法。常见的选择是采用多模态大型语言模型 (MLLM) 作为验证器，这可以提高性能，但会带来大量的推理时间成本。事实上，扩散管道在自动编码器潜在空间中运行以减少计算，但 MLLM 验证器仍然需要将候选解码到像素空间并将它们重新编码到视觉嵌入空间中，从而导致冗余且昂贵的操作。在这项工作中，我们提出了隐藏状态验证器（VHS），这是一种直接对扩散变压器（DiT）单步生成器的中间隐藏表示进行操作的验证器。 VHS 分析生成器特征，无需解码到像素空间，从而降低每个候选者的验证成本，同时提高或匹配基于 MLLM 的竞争对手的性能。我们表明，在每次提示只有少量候选的微小推理预算下，VHS 能够实现更高效的推理时间扩展，相对于标准 MLLM 验证器，将联合生成和验证时间减少 63.3%，计算 FLOP 减少 51%，VRAM 使用量减少 14.5%，在相同的推理时间预算下，比 GenEval 提高 2.7%。

Title: Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

Authors: Delin An, Chaoli Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22509
Pdf URL: https://arxiv.org/pdf/2603.22509
Copy Paste: [[2603.22509]] Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation(https://arxiv.org/abs/2603.22509)
Keywords: generation
Abstract: Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at this https URL.
摘要：扩散概率模型在生成高质量、逼真的医学图像方面表现出了巨大的潜力，为解决医学领域数据稀缺的持续挑战提供了有前途的解决方案。然而，在多模态条件下生成具有解剖学一致结构的 3D 医疗体积仍然是一个复杂且尚未解决的问题。我们引入了 Sketch2CT，这是一种用于结构感知 3D 医学体积生成的多模态扩散框架，由用户提供的 2D 草图和捕获 3D 几何语义的文本描述共同引导。该框架最初根据两种模态的随机噪声生成目标器官的 3D 分割掩模。为了有效地对齐和融合这些输入，我们提出了两个关键模块，它们利用本地化文本提示细化草图特征并集成全局草图文本表示。这些模块建立在胶囊注意力主干之上，利用草图和文本的互补优势来生成解剖学上准确的器官形状。合成的分割掩模随后指导用于 3D CT 体积合成的潜在扩散模型，从而实现与用户定义的草图和描述一致的器官外观的真实重建。对公共 CT 数据集的大量实验表明，Sketch2CT 在生成多模态医疗体积方面实现了卓越的性能。其可控、低成本的生成管道能够原则性地、高效地增强医疗数据集。代码可从此 https URL 获取。

Title: Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Authors: Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.22529
Pdf URL: https://arxiv.org/pdf/2603.22529
Copy Paste: [[2603.22529]] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos(https://arxiv.org/abs/2603.22529)
Keywords: generation
Abstract: Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.
摘要：多模式人工智能代理越来越多地实现涉及在线网络执行的复杂现实工作流程的自动化。然而，当前的网络代理基准测试存在一个严重的限制：它们完全专注于基于网络的交互和感知，缺乏对用户现实世界物理环境的基础。这种限制阻碍了关键场景中的评估，例如代理必须使用以自我为中心的视觉感知（例如通过 AR 眼镜）来识别用户周围的物体，然后在线完成相关任务。为了解决这一差距，我们推出了 Ego2Web，这是第一个旨在弥合以自我为中心的视频感知和 Web 代理执行之间的基准测试。 Ego2Web 将现实世界的第一人称视频录制与需要视觉理解、网络任务规划以及在线环境中的交互才能成功完成的网络任务配对。我们利用自动数据生成管道与人工验证和细化相结合，在不同的网络任务类型（包括电子商务、媒体检索、知识查找等）中策划结构良好的高质量视频任务对。为了促进对我们的基准进行准确和可扩展的评估，我们还开发了一种新颖的LLM作为法官自动评估方法Ego2WebJudge，该方法与人类判断的一致性约为84％，远远高于现有的评估方法。在我们的 Ego2Web 上对不同 SoTA 代理进行的实验表明，它们的性能很弱，在所有任务类别上都有很大的空间。我们还对任务设计进行了全面的消融研究，强调了所提出的任务中准确视频理解的必要性以及当前智能体的局限性。我们希望 Ego2Web 能够成为开发真正强大的人工智能助手的重要新资源，这些助手可以在物理和数字世界中无缝地查看、理解和行动。

Title: UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images

Authors: Kaizhen Tan, Fan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22531
Pdf URL: https://arxiv.org/pdf/2603.22531
Copy Paste: [[2603.22531]] UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images(https://arxiv.org/abs/2603.22531)
Keywords: generation
Abstract: Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.
摘要：人行道宽度是行人可达性、舒适度和网络质量的重要指标，但大多数城市仍然缺乏大规模宽度数据。现有方法通常依赖于昂贵的现场调查、高分辨率俯视图像或限制可扩展性或引入系统误差的简化几何假设。为了解决这一差距，我们提出了 UrbanVGGT，这是一种测量管道，用于从单个街景图像估计公制人行道宽度。该方法结合了语义分割、前馈 3D 重建、自适应地平面拟合、基于相机高度的比例校准以及恢复平面上的定向宽度测量。在华盛顿特区的地面实况基准上，UrbanVGGT 的平均绝对误差为 0.252 m，其中 95.5% 的估计值在参考宽度的 0.50 m 范围内。烧蚀实验表明，公制刻度校准是最关键的组成部分，并且与替代几何主干的受控比较支持整体设计的有效性。作为可行性演示，我们进一步将该管道应用于三个城市并生成 SV-SideWidth，这是一个覆盖 527 个 OpenStreetMap 街道路段的原型人行道宽度数据集。结果表明，街景图像可以支持候选人行道宽度属性的可扩展生成，而在作为权威规划数据部署之前，更广泛的跨城市验证和本地真实审核仍然是必要的。

Title: Generalized multi-object classification and tracking with sparse feature resonator networks

Authors: Lazar Supic, Alec Mullen, E. Paxon Frady
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22539
Pdf URL: https://arxiv.org/pdf/2603.22539
Copy Paste: [[2603.22539]] Generalized multi-object classification and tracking with sparse feature resonator networks(https://arxiv.org/abs/2603.22539)
Keywords: generative
Abstract: In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.
摘要：在视觉场景理解任务中，捕获不变结构和等变结构至关重要。虽然神经网络经常被训练来实现翻译等变换的不变性，但这通常是以失去对等变信息的访问为代价的——例如，物体的精确位置。此外，仅通过监督学习并不能自然地保证不变性，并且许多架构对于训练期间未遇到的输入转换的泛化能力很差。在这里，我们采用基于使用谐振器网络进行综合分析和因式分解的方法。生成模型描述了包含 MNIST 数字及其转换（例如颜色和位置）的简单场景的构造。谐振器网络反转生成模型，并提供有关特定对象的不变和等变信息。从训练数据中学习到的稀疏特征充当基础集，以提供表示对象可变形状的灵活性，从而允许谐振器网络处理测试集中以前未见过的数字形状。模块化结构提供了一个形状模块，其中包含有关对象形状的信息，并分解了翻译，从而允许简单的分类器对居中的数字进行操作。分类层仅在中心数据上进行训练，需要更少的训练数据，并且整个网络可以通过任意翻译来识别对象，而无需数据增强。谐振器网络的自然注意力机制还允许分析具有多个对象的场景，其中网络动态一次仅选择一个对象并将其居中。此外，可以从平移模块中提取特定物体的具体位置信息，并且我们表明谐振器可以设计为以几个像素的精度跟踪多个移动物体。

Title: MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data

Authors: Xingzhi Sun, João Felipe Rocha, Brett Phelan, Dhananjay Bhaskar, Guillaume Huguet, Yanlei Zhang, D.S. Magruder, Alexander Tong, Ke Xu, Oluwadamilola Fasina, Mark Gerstein, Guy Wolf, Natalia Ivanova, Christine L. Chaffer, Smita Krishnaswamy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.22564
Pdf URL: https://arxiv.org/pdf/2603.22564
Copy Paste: [[2603.22564]] MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data(https://arxiv.org/abs/2603.22564)
Keywords: generation, generative
Abstract: Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disease. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data's intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories.
摘要：通过时间分辨单细胞转录组学了解细胞轨迹对于研究发育、再生和疾病至关重要。一个关键的挑战是从离散快照推断连续轨迹。生物复杂性源于随机细胞命运决定、时间增殖变化和空间环境影响。目前的方法通常使用确定性插值法来单独处理细胞，无法捕获驱动真实生物过程的概率分支、群体变化和生态位依赖性信号传导。我们推出了流形插值最佳传输流 (MIOFlow) 2.0。该框架通过集成流形学习、最优传输和神经微分方程来学习生物信息的细胞轨迹。它对三个核心过程进行建模：（1）通过神经随机微分方程实现随机性和分支；（2）使用以不平衡最优运输初始化的学习增长率模型进行非保守人口变化； (3) 通过联合潜在空间来影响环境，将基因表达与局部细胞类型组成和信号传导等空间特征统一起来。通过在 PHATE 距离匹配自动编码器潜在空间中运行，MIOFlow 2.0 确保轨迹尊重数据的内在几何形状。实证比较表明，通过神经微分方程进行的表达轨迹学习优于现有的生成模型，包括免模拟流匹配。 MIOFlow 2.0 经过综合数据集、胚体分化和空间分辨蝾螈大脑再生的验证，提高了轨迹准确性并揭示了细胞转变的隐藏驱动因素，例如特定的信号传导生态位。因此，MIOFlow 2.0 连接了单细胞和空间转录组学，以揭示组织尺度的轨迹。

Title: TrajLoom: Dense Future Trajectory Generation from Video

Authors: Zewei Zhang, Jia Jun Cheng Xian, Kaiwen Liu, Ming Liang, Hang Chu, Jun Chen, Renjie Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22606
Pdf URL: https://arxiv.org/pdf/2603.22606
Copy Paste: [[2603.22606]] TrajLoom: Dense Future Trajectory Generation from Video(https://arxiv.org/abs/2603.22606)
Keywords: generation
Abstract: Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at this https URL.
摘要：预测未来运动对于视频理解和可控视频生成至关重要。密集点轨迹是一种紧凑的、富有表现力的运动表示，但根据观察到的视频对其未来的演变进行建模仍然具有挑战性。我们提出了一个框架，可以根据过去的轨迹和视频上下文预测未来的轨迹和可见性。我们的方法由三个部分组成：（1）网格锚点偏移编码，它通过将每个点表示为其像素中心锚点的偏移量来减少位置相关的偏差； (2) TrajLoom-VAE，它通过掩模重建和时空一致性正则化器学习密集轨迹的紧凑时空潜在空间； (3) TrajLoom-Flow，它通过流匹配生成潜在空间中的未来轨迹，并具有边界线索和用于稳定采样的策略 K 步微调。我们还推出了 TrajLoomBench，这是一个涵盖真实视频和合成视频的统一基准，其标准化设置与视频生成基准相一致。与最先进的方法相比，我们的方法将预测范围从 24 帧扩展到 81 帧，同时提高了运动真实性和跨数据集的稳定性。预测的轨迹直接支持下游视频生成和编辑。代码、模型检查点和数据集可从此 https URL 获取。

Title: Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

Authors: Fulvio Sanguigni, Davide Lobba, Bin Ren, Marcella Cornia, Nicu Sebe, Rita Cucchiara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22607
Pdf URL: https://arxiv.org/pdf/2603.22607
Copy Paste: [[2603.22607]] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off(https://arxiv.org/abs/2603.22607)
Keywords: generation
Abstract: Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.
摘要：虚拟试穿 (VTON) 和虚拟试穿 (VTOFF) 的最新进展极大地改进了逼真的时装合成和服装重建。然而，现有的数据集仍然是静态的，缺乏用于可控和交互式时尚生成的指令驱动编辑。在这项工作中，我们介绍了服装编辑数据集 (Dress-ED)，这是第一个将 VTON、VTOFF 和文本引导服装编辑统一在单个框架内的大型基准。 Dress-ED 中的每个样本都包含店内服装图像、穿着该服装的相应人物图像、编辑后的对应图像以及所需修改的自然语言指令。 Dress-ED 通过集成了基于 MLLM 的服装理解、基于扩散的编辑和 LLM 引导的验证的全自动多模式管道构建，包含超过 146,000 个经过验证的四元组，涵盖三个服装类别和七种编辑类型，包括外观（例如颜色、图案、材料）和结构（例如袖长、领口）修改。基于这个基准，我们进一步提出了一个统一的多模态扩散框架，该框架联合推理语言指令和视觉服装线索，作为指令驱动的 VTON 和 VTOFF 的强大基线。数据集和代码将公开。

Title: A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images

Authors: Heesup Yun, Isaac Kazuo Uyehara, Ioannis Droutsas, Earl Ranario, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22622
Pdf URL: https://arxiv.org/pdf/2603.22622
Copy Paste: [[2603.22622]] A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images(https://arxiv.org/abs/2603.22622)
Keywords: generation
Abstract: Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant's architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.
摘要：三维 (3D) 程序植物结构模型已成为基于模拟的植物结构和功能研究、从现场测量中提取植物结构参数以及在计算机图形中生成真实植物的重要工具。然而，在现场测量这些模型的架构参数和嵌套结构仍然是极其耗费人力的。我们提出了一种新颖的算法，可以从图像生成 3D 植物结构，创建反映器官级几何和拓扑参数的功能性植物结构模型，并提供植物结构的更全面的表示。我们提出了一种生成标记序列的方法，该标记序列对植物结构的程序定义进行编码，而不是使用 3D 传感器或通过计算机视觉处理多视图图像来获取植物的 3D 结构。这项工作仅使用合成图像进行训练和测试，并且已知确切的架构参数，从而可以测试可以使用视觉语言模型（VLM）从图像数据中提取器官级架构参数的假设。使用 Helios 3D 植物模拟器生成豇豆植物图像的合成数据集，详细的植物结构编码在 XML 文件中。我们为定义工厂架构的 XML 文件开发了一个工厂架构标记器，将其转换为语言模型可以预测的标记序列。该模型在教师强制训练期间取得了 0.73 的 F1 分数。通过自回归生成对模型进行评估，BLEU-4 得分为 94.00%，ROUGE-L 得分为 0.5182。由此得出的结论是，可以从合成图像中生成此类植物结构模型和参数提取；因此，未来的工作将将该方法扩展到真实图像数据。

Title: Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

Authors: Yuxuan Jiang, Yixuan Li, Hanwei Zhu, Siyue Teng, Fan Zhang, David Bull
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22641
Pdf URL: https://arxiv.org/pdf/2603.22641
Copy Paste: [[2603.22641]] Q-Tacit: Image Quality Assessment via Latent Visual Reasoning(https://arxiv.org/abs/2603.22641)
Keywords: quality assessment
Abstract: Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, "Is natural language the ideal space for quality reasoning?" and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.
摘要：通过结合思想链 (CoT) 推理，基于视觉语言模型 (VLM) 的图像质量评估 (IQA) 得到了显着改进。最近的工作通过应用强化学习（RL）和利用主动视觉工具改进了图像质量推理。然而，此类策略通常以语言为中心，视觉信息被视为静态前提条件。由于离散文本标记和质量感知空间之间的差距，与质量相关的视觉线索通常无法抽象为文本，这反过来又限制了视觉密集型 IQA 任务的推理有效性。在本文中，我们通过提出以下问题重新审视这一点：“自然语言是高质量推理的理想空间吗？”因此，我们提出了 Q-Tacit，这是一种新的范式，可以引发 VLM 在潜在质量空间中进行超越自然语言的推理。我们的方法遵循协同的两阶段过程：（i）将结构视觉质量先验注入潜在空间，以及（ii）校准潜在推理轨迹以提高质量评估能力。大量实验表明，Q-Tacit 可以使用比以前基于推理的方法少得多的标记有效地执行质量推理，同时实现强大的整体性能。本文验证了语言并不是唯一适合视觉质量的紧凑表示的命题，为进一步探索 IQA 的有效潜在推理范式提供了可能性。将发布源代码以支持未来的研究。

Title: GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning

Authors: Jiayin Sun, Caixia Sun, Boyu Yang, Hailin Li, Xiao Chen, Yi Zhang, Errui Ding, Liang Li, Chao Deng, Junlan Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22687
Pdf URL: https://arxiv.org/pdf/2603.22687
Copy Paste: [[2603.22687]] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning(https://arxiv.org/abs/2603.22687)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: this https URL.
摘要：多模态大语言模型（MLLM）最近表现出了卓越的感知和推理能力。然而，他们很难感知细粒度的几何结构，限制了他们的几何理解和视觉推理的能力。为了解决这个问题，我们提出了 GeoTikzBridge，这是一个通过基于 tikz 的代码生成来增强局部几何感知和视觉推理的框架。在此框架内，我们构建了由两个互补数据集支持的两个模型。 GeoTikzBridge-Base 模型在 GeoTikz-Base 数据集上进行训练，这是迄今为止最大的图像到 tikz 数据集，有 250 万对（比现有开源数据集大 16 倍）。这个过程是通过迭代数据扩展和局部几何变换策略来实现的。随后，GeoTikzBridge-Instruct 在 GeoTikz-Instruct 数据集上进行了微调，这是第一个支持视觉推理的指令增强 tikz 数据集。大量的实验结果表明，我们的模型在开源 MLLM 中实现了最先进的性能。此外，GeoTikzBridge模型可以作为任何MLLM（LLM）的即插即用推理模块，增强几何问题解决的推理性能。数据集和代码可在以下网址公开获取：此 https URL。

Title: WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment

Authors: Tzu-Ti Wei, Chu-Yu Huang, Yu-Chee Tseng, Jen-Jee Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22690
Pdf URL: https://arxiv.org/pdf/2603.22690
Copy Paste: [[2603.22690]] WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment(https://arxiv.org/abs/2603.22690)
Keywords: generation
Abstract: Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher's visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A prefix-tuned language model then generates action descriptions from CSI embeddings. We also introduce the WiFi2Cap Dataset, a synchronized CSI-RGB-sentence benchmark for semantic captioning from Wi-Fi signals. Experimental results show that WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, demonstrating effective privacy-friendly semantic sensing.
摘要：对人类活动的隐私保护语义理解对于室内传感非常重要，但现有的基于 Wi-Fi CSI 的系统主要关注姿势估计或预定义的动作分类，而不是细粒度的语言生成。由于无线信号和语言之间的语义差距以及方向敏感的歧义（例如左/右肢体混淆），将 CSI 映射到自然语言描述仍然具有挑战性。我们提出 WiFi2Cap，一个三阶段框架，用于直接从 Wi-Fi CSI 生成动作字幕。视觉语言教师从同步的视频文本对中学习可转移监督，CSI 学生则与教师的视觉空间和文本嵌入保持一致。为了改进方向敏感的字幕，我们引入了镜像一致性损失，可以减少跨模式对齐期间的镜像动作和左右模糊性。然后，经过前缀调整的语言模型会根据 CSI 嵌入生成动作描述。我们还介绍了 WiFi2Cap 数据集，这是一个用于 Wi-Fi 信号语义字幕的同步 CSI-RGB 句子基准。实验结果表明，WiFi2Cap 在 BLEU-4、METEOR、ROUGE-L、CIDEr 和 SPICE 上始终优于基线方法，展示了有效的隐私友好型语义感知。

Title: TimeWeaver: Age-Consistent Reference-Based Face Restoration with Identity Preservation

Authors: Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22701
Pdf URL: https://arxiv.org/pdf/2603.22701
Copy Paste: [[2603.22701]] TimeWeaver: Age-Consistent Reference-Based Face Restoration with Identity Preservation(https://arxiv.org/abs/2603.22701)
Keywords: restoration
Abstract: Recent progress in face restoration has shifted from visual fidelity to identity fidelity, driving a transition from reference-free to reference-based paradigms that condition restoration on reference images of the same person. However, these methods assume the reference and degraded input are age-aligned. When only cross-age references are available, as in historical restoration or missing-person retrieval, they fail to maintain age fidelity. To address this limitation, we propose TimeWeaver, the first reference-based face restoration framework supporting cross-age references. Given arbitrary reference images and a target-age prompt, TimeWeaver produces restorations with both identity fidelity and age consistency. Specifically, we decouple identity and age conditioning across training and inference. During training, the model learns an age-robust identity representation by fusing a global identity embedding with age-suppressed facial tokens via a transformer-based ID-Fusion module. During inference, two training-free techniques, Age-Aware Gradient Guidance and Token-Targeted Attention Boost, steer sampling toward desired age semantics, enabling precise adherence to the target-age prompt. Extensive experiments show that TimeWeaver surpasses existing methods in visual quality, identity preservation, and age consistency.
摘要：人脸恢复方面的最新进展已经从视觉保真度转向身份保真度，推动了从无参考范式到基于参考范式的转变，这些范式将恢复条件限制在同一个人的参考图像上。然而，这些方法假设参考和降级输入是与年龄一致的。当只有跨年龄参考可用时，例如在历史修复或失踪人员检索中，它们无法保持年龄保真度。为了解决这个限制，我们提出了 TimeWeaver，这是第一个支持跨年龄参考的基于参考的人脸恢复框架。给定任意参考图像和目标年龄提示，TimeWeaver 可以生成具有身份保真度和年龄一致性的修复体。具体来说，我们在训练和推理中将身份和年龄条件解耦。在训练过程中，该模型通过基于 Transformer 的 ID-Fusion 模块将全局身份嵌入与年龄抑制的面部标记融合，从而学习年龄稳健的身份表示。在推理过程中，年龄感知梯度引导和标记目标注意力增强这两种免训练技术可将采样引导至所需的年龄语义，从而能够精确遵守目标年龄提示。大量实验表明，TimeWeaver 在视觉质量、身份保存和年龄一致性方面超越了现有方法。

Title: Behavioral Heterogeneity as Quantum-Inspired Representation

Authors: Mohammad Elayan, Wissam Kontar
Subjects: cs.LG, cs.MA, stat.ME
Abstract URL: https://arxiv.org/abs/2603.22729
Pdf URL: https://arxiv.org/pdf/2603.22729
Copy Paste: [[2603.22729]] Behavioral Heterogeneity as Quantum-Inspired Representation(https://arxiv.org/abs/2603.22729)
Keywords: generation
Abstract: Driver heterogeneity is often reduced to labels or discrete regimes, compressing what is inherently dynamic into static categories. We introduce quantum-inspired representation that models each driver as an evolving latent state, presented as a density matrix with structured mathematical properties. Behavioral observations are embedded via non-linear Random Fourier Features, while state evolution blends temporal persistence of behavior with context-dependent profile activation. We evaluate our approach on empirical driving data, Third Generation Simulation Data (TGSIM), showing how driving profiles are extracted and analyzed.
摘要：驾驶员的异质性通常被简化为标签或离散状态，将本质上动态的内容压缩为静态类别。我们引入了受量子启发的表示，将每个驱动器建模为不断变化的潜在状态，并以具有结构化数学属性的密度矩阵的形式呈现。行为观察通过非线性随机傅立叶特征嵌入，而状态演化将行为的时间持久性与上下文相关的配置文件激活相结合。我们根据经验驾驶数据、第三代模拟数据 (TGSIM) 评估我们的方法，展示如何提取和分析驾驶档案。

Title: From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery

Authors: Bijay Shakya, Catherine Hoier, Khandaker Mamun Ahmed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22768
Pdf URL: https://arxiv.org/pdf/2603.22768
Copy Paste: [[2603.22768]] From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery(https://arxiv.org/abs/2603.22768)
Keywords: restoration, super-resolution
Abstract: Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for comprehensive post-disaster building damage assessment. First, we enhance pre- and post-disaster satellite imagery using a Video Restoration Transformer (VRT) to upscale images from 1024x1024 to 4096x4096 resolution, improving structural detail visibility. Next, a YOLOv11-based detector localizes buildings in pre-disaster imagery, and cropped building regions are analyzed using VLMs to semantically assess structural damage across four severity levels. To ensure robust evaluation in the absence of ground-truth captions, we employ CLIPScore for reference-free semantic alignment and introduce a multi-model VLM-as-a-Jury strategy to reduce individual model bias in safety-critical decision making. Experiments on subsets of the xBD dataset, including the Moore Tornado and Hurricane Matthew events, demonstrate that the proposed framework enhances the semantic interpretation of damaged buildings. In addition, our framework provides helpful recommendations to first responders for recovery based on damage analysis.
摘要：自然灾害后快速准确的结构损坏评估对于有效的应急响应和恢复至关重要。然而，遥感图像往往存在空间分辨率低、上下文模糊和语义可解释性有限的问题，从而降低了传统检测管道的可靠性。在这项工作中，我们提出了一种新颖的混合框架，集成了基于人工智能的超分辨率、深度学习对象检测和视觉语言模型（VLM），用于全面的灾后建筑损坏评估。首先，我们使用视频恢复转换器 (VRT) 增强灾前和灾后卫星图像，将图像分辨率从 1024x1024 升级到 4096x4096，从而提高结构细节的可视性。接下来，基于 YOLOv11 的探测器在灾前图像中定位建筑物，并使用 VLM 分析裁剪的建筑物区域，以语义评估四个严重级别的结构损坏。为了确保在没有真实字幕的情况下进行稳健的评估，我们采用 CLIPScore 进行无参考语义对齐，并引入多模型 VLM-as-a-Jury 策略来减少安全关键决策中的个体模型偏差。对 xBD 数据集子集（包括摩尔龙卷风和马修飓风事件）的实验表明，所提出的框架增强了受损建筑物的语义解释。此外，我们的框架还根据损害分析为急救人员提供了有用的恢复建议。

Title: Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

Authors: Wenyue Chen, Wenjue Chen, Peng Li, Qinghe Wang, Xu Jia, Heliang Zheng, Rongfei Jia, Yuan Liu, Ronggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22782
Pdf URL: https://arxiv.org/pdf/2603.22782
Copy Paste: [[2603.22782]] Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models(https://arxiv.org/abs/2603.22782)
Keywords: generation, generative
Abstract: Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.
摘要：3D 生成领域的最新进展提高了合成 3D 资产的保真度和几何细节。然而，由于单视图观测固有的模糊性以及有限的 3D 训练数据导致缺乏稳健的全局结构先验，现有模型生成的看不见的区域通常是随机的且难以控制，有时可能无法与用户意图保持一致或产生不可信的几何形状。在本文中，我们提出了 Know3D，这是一种新颖的框架，通过潜在隐藏状态注入将来自多模态大型语言模型的丰富知识融入到 3D 生成过程中，从而实现语言可控的 3D 资产后视图生成。我们利用基于 VLM 扩散的模型，其中 VLM 负责语义理解和指导。扩散模型充当将语义知识从 VLM 传输到 3D 生成模型的桥梁。通过这种方式，我们成功地弥合了抽象文本指令和未观察区域的几何重建之间的差距，将传统的随机后视幻觉转变为语义可控的过程，为未来的3D生成模型展示了一个有前途的方向。

Title: Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models

Authors: Amir Azarmehr, Soheil Behnezhad, Alma Ghafari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.22784
Pdf URL: https://arxiv.org/pdf/2603.22784
Copy Paste: [[2603.22784]] Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models(https://arxiv.org/abs/2603.22784)
Keywords: generation
Abstract: Large language models (LLMs) can often produce substantially better outputs when allowed to use additional test-time computation, such as sampling, chain of thought, backtracking, or revising partial solutions. Despite the growing empirical success of such techniques, there is limited theoretical understanding of how inference time computation should be structured, or what constitutes an optimal use of a fixed computation budget. We model test-time computation as an algorithm interacting with a Markov chain: at any point, the algorithm may resume generation from any previously observed state. That is, unlike standard Markov chains where the states are drawn passively, we allow the algorithm to backtrack to any previously observed state of the Markov chain at any time. Many of the existing test-time algorithms, such as Chain-of-Thought (CoT) (Wei et al., 2023), Tree-of-Thoughts (ToT) (Yao et al., 2023), or Best-of-$k$ (Brown et al., 2024) could be seen as specific algorithms in this model. We prove that while backtracking can reduce the number of generations exponentially, a very limited form of backtracking is theoretically sufficient. Namely, we show that the optimal algorithm always generates a caterpillar tree. That is, if we remove the leaves of the state tree generated by the optimal algorithm, we obtain a path. Motivated by our characterization of the optimal algorithm, we present Caterpillar of Thoughts (CaT), a new test-time computation algorithm, reducing the number of token/state generations. Our empirical evaluation shows that CaT, compared to ToT, achieves a better success rate while also reducing the number of token generations.
摘要：当允许使用额外的测试时计算（例如采样、思维链、回溯或修改部分解决方案）时，大型语言模型 (LLM) 通常可以产生更好的输出。尽管此类技术在经验上取得了越来越大的成功，但对于如何构建推理时间计算，或者什么构成了固定计算预算的最佳使用，理论理解仍然有限。我们将测试时计算建模为与马尔可夫链交互的算法：在任何时候，该算法都可以从任何先前观察到的状态恢复生成。也就是说，与被动绘制状态的标准马尔可夫链不同，我们允许算法随时回溯到马尔可夫链的任何先前观察到的状态。许多现有的测试时算法，例如 Chain-of-Thought (CoT) (Wei et al., 2023)、Tree-of-Thoughts (ToT) (Yao et al., 2023) 或 Best-of-$k$ (Brown et al., 2024) 都可以被视为该模型中的特定算法。我们证明，虽然回溯可以成倍减少代数，但理论上非常有限的回溯形式就足够了。也就是说，我们证明最优算法总是生成毛毛虫树。也就是说，如果我们去掉最优算法生成的状态树的叶子，我们就得到了一条路径。受我们对最佳算法特征的启发，我们提出了 Caterpillar of Thought (CaT)，这是一种新的测试时计算算法，可减少令牌/状态生成的数量。我们的实证评估表明，与 ToT 相比，CaT 取得了更好的成功率，同时也减少了代币生成的数量。

Title: It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal

Authors: Lishen Qu, Shihao Zhou, Jie Liang, Hui Zeng, Lei Zhang, Jufeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22794
Pdf URL: https://arxiv.org/pdf/2603.22794
Copy Paste: [[2603.22794]] It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal(https://arxiv.org/abs/2603.22794)
Keywords: restoration
Abstract: Flicker artifacts, arising from unstable illumination and row-wise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting. Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network's ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available at this https URL.
摘要：由不稳定的照明和行方向曝光不一致引起的闪烁伪影对短曝光摄影构成了重大挑战，严重降低了图像质量。与典型的伪像（例如噪声和低光）不同，闪烁是具有特定时空模式的结构化退化，当前的通用恢复框架没有考虑到这些因素，从而导致闪烁抑制和重影伪像不理想。在这项工作中，我们揭示了闪烁伪像表现出两个内在特征：周期性和方向性，并提出了 Flickerformer，一种基于变压器的架构，可以有效消除闪烁而不引入重影。具体来说，Flickerformer 包括三个关键组件：基于相位的融合模块（PFM）、自相关前馈网络（AFFN）和基于小波的方向注意模块（WDAM）。基于周期性，PFM 执行帧间相位相关以自适应聚合突发特征，而 AFFN 通过自相关利用帧内结构规律，共同增强网络感知空间重复模式的能力。此外，受闪烁伪影方向性的启发，WDAM 利用小波域中的高频变化来指导低频暗区的恢复，从而实现闪烁伪影的精确定位。大量实验表明，Flickerformer 在定量指标和视觉质量方面均优于最先进的方法。源代码可从此 https URL 获取。

Title: URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection

Authors: Wei Luo, Peng Xing, Yunkang Cao, Haiming Yao, Weiming Shen, Zechao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22840
Pdf URL: https://arxiv.org/pdf/2603.22840
Copy Paste: [[2603.22840]] URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection(https://arxiv.org/abs/2603.22840)
Keywords: restoration
Abstract: Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.
摘要：无监督异常检测在工业缺陷检测和医学图像分析中发挥着关键作用，大多数方法依赖于重建框架。然而，这些方法可能会受到过度泛化的影响，使它们能够很好地重建异常，从而导致检测性能较差。为了解决这个问题，我们不只关注常态重建，而是提出了一种创新的不确定性集成异常感知和恢复注意网络（URA-Net），它明确地将异常模式恢复到相应的常态。首先，与传统的图像重建方法不同，我们利用预先训练的卷积神经网络来提取多级语义特征作为重建目标。为了帮助 URA-Net 学习恢复异常，我们引入了一种新颖的特征级人工异常合成模块来生成用于训练的异常样本。随后，引入了一种基于贝叶斯神经网络的新型不确定性集成异常感知模块来学习异常和正常特征的分布。这有利于异常区域和模糊边界的估计，为后续异常修复奠定基础。然后，我们提出了一种新颖的恢复注意机制，利用全局正常语义信息来恢复检测到的异常区域，从而获得无缺陷的恢复特征。最后，我们利用输入特征和恢复特征之间的残差图来进行异常检测和定位。在两个工业数据集 MVTec AD 和 BTAD 以及医学图像数据集 OCT-2017 上的综合实验结果明确证明了该方法的有效性和优越性。

Title: A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection

Authors: Wei Luo, Haiming Yao, Zhenfeng Qiang, Xiaotian Zhang, Weihang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22861
Pdf URL: https://arxiv.org/pdf/2603.22861
Copy Paste: [[2603.22861]] A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection(https://arxiv.org/abs/2603.22861)
Keywords: restoration
Abstract: Unsupervised anomaly detection is vital in industrial fields, with reconstruction-based methods favored for their simplicity and effectiveness. However, reconstruction methods often encounter an identical shortcut issue, where both normal and anomalous regions can be well reconstructed and fail to identify outliers. The severity of this problem increases with the complexity of the normal data distribution. Consequently, existing methods may exhibit excellent detection performance in a specific scenario, but their performance sharply declines when transferred to another scenario. This paper focuses on establishing a universal model applicable to anomaly detection tasks across different settings, termed as universal anomaly detection. In this work, we introduce a novel, straightforward yet efficient framework for universal anomaly detection: \uline{F}eature \uline{S}huffling and \uline{R}estoration (FSR), which can alleviate the identical shortcut issue across different settings. First and foremost, FSR employs multi-scale features with rich semantic information as reconstruction targets, rather than raw image pixels. Subsequently, these multi-scale features are partitioned into non-overlapping feature blocks, which are randomly shuffled and then restored to their original state using a restoration network. This simple paradigm encourages the model to focus more on global contextual information. Additionally, we introduce a novel concept, the shuffling rate, to regulate the complexity of the FSR task, thereby alleviating the identical shortcut across different settings. Furthermore, we provide theoretical explanations for the effectiveness of FSR framework from two perspectives: network structure and mutual information. Extensive experimental results validate the superiority and efficiency of the FSR framework across different this http URL is available at this https URL.
摘要：无监督异常检测在工业领域至关重要，基于重建的方法因其简单性和有效性而受到青睐。然而，重建方法经常遇到相同的捷径问题，即正常区域和异常区域都可以很好地重建，但无法识别异常值。这个问题的严重性随着正态数据分布的复杂性而增加。因此，现有方法在特定场景下可能表现出优异的检测性能，但当转移到另一个场景时，其性能会急剧下降。本文重点建立一个适用于不同环境下的异常检测任务的通用模型，称为通用异常检测。在这项工作中，我们引入了一种新颖、简单而高效的通用异常检测框架：\uline{F}eature \uline{S}huffling 和 \uline{R}estoration (FSR)，它可以缓解不同设置下的相同快捷方式问题。首先，FSR 采用具有丰富语义信息的多尺度特征作为重建目标，而不是原始图像像素。随后，这些多尺度特征被划分为不重叠的特征块，这些特征块被随机洗牌，然后使用恢复网络恢复到原始状态。这个简单的范例鼓励模型更多地关注全局上下文信息。此外，我们引入了一个新的概念，即洗牌率，来调节 FSR 任务的复杂性，从而减轻不同设置之间的相同捷径。此外，我们从网络结构和互信息两个角度对FSR框架的有效性进行了理论解释。大量的实验结果验证了 FSR 框架在不同的 https URL 上的优越性和效率。

Title: Designing to Forget: Deep Semi-parametric Models for Unlearning

Authors: Amber Yijia Zheng, Yu-Shan Tai, Raymond A. Yeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22870
Pdf URL: https://arxiv.org/pdf/2603.22870
Copy Paste: [[2603.22870]] Designing to Forget: Deep Semi-parametric Models for Unlearning(https://arxiv.org/abs/2603.22870)
Keywords: generation
Abstract: Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11\%$ and achieve over $10\times$ faster unlearning compared to existing approaches on parametric models. The code is available at this https URL.
摘要：机器去学习的最新进展集中在开发算法以从训练模型中删除特定的训练样本。相比之下，我们观察到并非所有模型都同样容易忘记。因此，我们引入了一系列深度半参数模型（SPM），它们在取消学习期间表现出非参数行为。 SPM 使用融合模块来聚合每个训练样本的信息，从而可以在测试时显式删除所选样本，而无需更改模型参数。根据经验，我们证明 SPM 在图像分类和生成方面实现了与参数模型竞争的任务性能，同时在遗忘方面显着更有效。值得注意的是，在 ImageNet 分类上，SPM 相对于重新训练的（oracle）基线将预测差距减少了 11\%$，并且与参数模型上的现有方法相比，实现了超过 10\times 的忘却速度。该代码可从此 https URL 获取。

Title: Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion

Authors: Shuangwu Qian, Xiaochan Yuan, Pengfei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22946
Pdf URL: https://arxiv.org/pdf/2603.22946
Copy Paste: [[2603.22946]] Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion(https://arxiv.org/abs/2603.22946)
Keywords: generation
Abstract: Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbf{PVGF-DPC} (\textit{Prompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning}), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels -- such as \emph{deity}, \emph{ritual pattern}, or \emph{hell ghost} -- and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9{}408 augmented images with culturally grounded annotations spanning seven thematic categories.
摘要：东巴画是中国西南部纳西族的珍贵绘画遗产，具有丰富的视觉元素、生动的色彩以及明显的民族和地域文化象征意义，但由于直接应用主流字幕模型时严重的领域转移，其自动文本描述在很大程度上尚未得到探索。本文提出了 \textbf{PVGF-DPC} （\textit{基于提示和视觉语义生成融合的东巴绘画字幕}），这是一种编码器-解码器框架，它将内容提示模块与新颖的视觉语义生成融合损失集成在一起，以弥合通用自然图像字幕与东巴艺术中发现的文化特定图像之间的差距。 MobileNetV2 编码器提取有区别的视觉特征，这些特征被注入到使用预训练的 BERT 权重初始化的 10 层 Transformer 解码器的层归一化中；同时，内容提示模块将图像特征向量映射到文化感知标签（例如 \emph{deity}、\emph{ritual pattern} 或 \emph{hell Ghost}），并构建一个后提示，引导解码器获得主题上准确的描述。视觉语义生成融合损失联合优化了提示预测器和标题生成器的交叉熵目标，鼓励模型提取关键的文化和视觉线索，并生成在语义上与输入图像对齐的标题。我们构建了一个专用的东巴绘画字幕数据集，其中包含 9{}408 张增强图像，以及涵盖七个主题类别的基于文化的注释。

Title: Few-Shot Generative Model Adaption via Identity Injection and Preservation

Authors: Yeqi He, Liang Li, Jiehua Zhang, Yaoqi Sun, Xichun Sheng, Zhidong Zhao, Chenggang Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22965
Pdf URL: https://arxiv.org/pdf/2603.22965
Copy Paste: [[2603.22965]] Few-Shot Generative Model Adaption via Identity Injection and Preservation(https://arxiv.org/abs/2603.22965)
Keywords: generative
Abstract: Training generative models with limited data presents severe challenges of mode collapse. A common approach is to adapt a large pretrained generative model upon a target domain with very few samples (fewer than 10), known as few-shot generative model adaptation. However, existing methods often suffer from forgetting source domain identity knowledge during adaptation, which degrades the quality of generated images in the target domain. To address this, we propose Identity Injection and Preservation (I$^2$P), which leverages identity injection and consistency alignment to preserve the source identity knowledge. Specifically, we first introduce an identity injection module that integrates source domain identity knowledge into the target domain's latent space, ensuring the generated images retain key identity knowledge of the source domain. Second, we design an identity substitution module, which includes a style-content decoupler and a reconstruction modulator, to further enhance source domain identity preservation. We enforce identity consistency constraints by aligning features from identity substitution, thereby preserving identity knowledge. Both quantitative and qualitative experiments show that our method achieves substantial improvements over state-of-the-art methods on multiple public datasets and 5 metrics.
摘要：用有限的数据训练生成模型提出了模式崩溃的严峻挑战。一种常见的方法是在样本很少（少于 10 个）的目标域上调整大型预训练生成模型，称为少样本生成模型调整。然而，现有方法在适应过程中经常会忘记源域身份知识，这会降低目标域中生成图像的质量。为了解决这个问题，我们提出身份注入和保留（I$^2$P），它利用身份注入和一致性对齐来保留源身份知识。具体来说，我们首先引入一个身份注入模块，它将源域身份知识集成到目标域的潜在空间中，确保生成的图像保留源域的关键身份知识。其次，我们设计了一个身份替换模块，其中包括样式内容解耦器和重建调制器，以进一步增强源域身份保存。我们通过调整身份替换中的特征来强制身份一致性约束，从而保留身份知识。定量和定性实验均表明，我们的方法在多个公共数据集和 5 个指标上比最先进的方法取得了实质性改进。

Title: WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

Authors: Manuel-Andreas Schneider, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22972
Pdf URL: https://arxiv.org/pdf/2603.22972
Copy Paste: [[2603.22972]] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion(https://arxiv.org/abs/2603.22972)
Keywords: generation
Abstract: Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment's geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.
摘要：图像和视频合成领域的最新进展激发了它们在推进 3D 场景生成方面的应用。然而，我们观察到，由于缺乏显式几何结构，文本到图像和视频方法很难在有限的环境规模之外保持场景和对象级的一致性。因此，我们提出了一种几何优先的方法，将大规模 3D 场景合成的复杂问题分解为其结构组成（表示为网格支架）和逼真的外观合成，该合成利用了以网格支架为条件的强大图像合成模型。根据输入的文本描述，我们首先构建一个捕获环境几何形状（墙壁、地板等）的网格，然后使用图像合成、分割和对象重建来使用真实布局中的对象填充网格结构。然后渲染该网格支架以调节图像合成，为一致的外观生成提供结构主干。这使得可扩展、任意大小的 3D 场景具有高对象丰富度和多样性，将强大的 3D 一致性与逼真的细节相结合。我们相信，这标志着向生成真正的环境规模、沉浸式 3D 世界迈出了重要一步。

Title: VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought

Authors: Xuanyu Zhang, Weiqi Li, Qunliang Xing, Jingfen Xie, Bin Chen, Junlin Li, Li Zhang, Jian Zhang, Shijie Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.22998
Pdf URL: https://arxiv.org/pdf/2603.22998
Copy Paste: [[2603.22998]] VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought(https://arxiv.org/abs/2603.22998)
Keywords: restoration, generation
Abstract: Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.
摘要：现实场景中的视频恢复面临异构降级的挑战，其中静态架构和固定推理管道通常无法泛化。最近基于代理的方法提供了动态决策，但现有的视频恢复代理仍然受到质量感知不足和低效搜索策略的限制。我们提出了 VQ-Jarvis，这是一种检索增强的一体化智能视频恢复代理，具有更清晰的视觉和更快的思维。 VQ-Jarvis 旨在准确感知配对恢复结果之间的退化和细微差异，同时有效地发现最佳恢复轨迹。为了实现清晰的视觉，我们构建了 VSR-Compare，这是第一个大规模视频配对增强数据集，具有 20K 个比较对，涵盖 7 种退化类型、11 种增强算子和不同的内容领域。基于该数据集，我们训练了一个多操作员判断模型和一个退化感知模型来指导代理决策。为了实现快速思考，我们引入了一种适应视频难度的分层算子调度策略：对于简单的情况，从检索增强生成（RAG）库中以一步方式检索最佳恢复轨迹；对于更困难的情况，会执行逐步的贪婪搜索以平衡效率和准确性。大量实验表明，VQ-Jarvis 在处理复杂的降级视频时始终优于现有方法。

Title: Zero-Shot Personalization of Objects via Textual Inversion

Authors: Aniket Roy, Maitreya Suin, Rama Chellappa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23010
Pdf URL: https://arxiv.org/pdf/2603.23010
Copy Paste: [[2603.23010]] Zero-Shot Personalization of Objects via Textual Inversion(https://arxiv.org/abs/2603.23010)
Keywords: generation
Abstract: Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.
摘要：文本到图像扩散模型的最新进展极大地提高了图像定制的质量，从而能够合成高度逼真的图像。尽管取得了这些进展，实现快速高效的个性化仍然是一个关键挑战，特别是对于现实世界的应用程序而言。现有方法主要通过将特定于身份的嵌入注入扩散模型来加速人类受试者的定制，但这些策略不能很好地推广到任意对象类别，从而限制了它们的适用性。为了解决这一限制，我们提出了一种新颖的框架，该框架采用学习网络来预测特定于对象的文本反转嵌入，随后将其集成到扩散模型的 UNet 时间步中以进行文本条件定制。这种设计可以在单次前向传递中对各种对象进行快速、零样本的个性化，从而提供灵活性和可扩展性。跨多个任务和设置的广泛实验证明了我们方法的有效性，突出了其支持快速、多功能和包容性图像定制的潜力。据我们所知，这项工作代表了在扩散模型中实现这种通用的、免训练的个性化的首次尝试，为未来个性化图像生成的研究铺平了道路。

Title: A Sobering Look at Tabular Data Generation via Probabilistic Circuits

Authors: Davide Scassola, Dylan Ponsford, Adrián Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23016
Pdf URL: https://arxiv.org/pdf/2603.23016
Copy Paste: [[2603.23016]] A Sobering Look at Tabular Data Generation via Probabilistic Circuits(https://arxiv.org/abs/2603.23016)
Keywords: generation, generative
Abstract: Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline -- hierarchical mixture models in the form of deep probabilistic circuits (PCs) -- which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at this https URL.
摘要：由于表格数据的异构特征和小得多的样本量，生成表格数据比文本和图像更具挑战性。在此任务中，基于扩散的模型是当前最先进的 (SotA) 模型类，在常用基准上实现了近乎完美的性能。在本文中，我们质疑对表格数据生成进展的看法。首先，我们强调当前协议在评估生成数据的保真度方面的局限性，并倡导替代协议。接下来，我们重新审视一个简单的基线——深度概率电路 (PC) 形式的分层混合模型——它以一小部分成本为 SotA 模型提供具有竞争力或卓越的性能。 PC 是决策森林的生成对应物，因此可以本地处理异构数据并提供易于处理的概率生成和推理。最后，在严格的实证分析中，我们表明 SotA 模型的进展明显饱和很大程度上是由于使用了不充分的指标。因此，我们强调，要生成真实的表格数据，还有很多工作要做。代码可在此 https URL 获取。

Title: Generative Event Pretraining with Foundation Model Alignment

Authors: Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.23032
Pdf URL: https://arxiv.org/pdf/2603.23032
Copy Paste: [[2603.23032]] Generative Event Pretraining with Foundation Model Alignment(https://arxiv.org/abs/2603.23032)
Keywords: generative
Abstract: Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
摘要：事件摄像机凭借其微秒延迟和高动态范围，可以在快速运动和具有挑战性的照明条件下提供强大的视觉信号。然而，它们独特的传感特性和有限的标记数据使得训练基于事件的视觉基础模型（VFM）变得具有挑战性，而这对于学习可跨任务转移的视觉特征至关重要。为了解决这个问题，我们提出了 GEP（生成事件预训练），这是一个两阶段框架，它将从互联网规模图像数据集中学习的语义知识转移到事件数据，同时学习特定于事件的时间动态。首先，事件编码器通过联合回归对比目标与冻结的 VFM 对齐，以图像语义中的事件特征为基础。其次，变压器主干在混合事件图像序列上进行自回归预训练，以捕获事件特有的时间结构。我们的方法在各种下游任务上都优于最先进的事件预训练方法，包括对象识别、分割和深度估计。 VFM 引导的对齐和生成序列建模共同产生了语义丰富、时间感知的事件模型，可以跨领域稳健地推广。

Title: HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling

Authors: António Cardoso, Pedro Sousa, Tania Pereira, Hélder P. Oliveira
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23041
Pdf URL: https://arxiv.org/pdf/2603.23041
Copy Paste: [[2603.23041]] HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling(https://arxiv.org/abs/2603.23041)
Keywords: generative
Abstract: Currently, a central challenge and bottleneck in the deployment and validation of computer-aided diagnosis (CAD) models within the field of medical imaging is data scarcity. For lung cancer, one of the most prevalent types worldwide, limited datasets can delay diagnosis and have an impact on patient outcome. Generative AI offers a promising solution for this issue, but dealing with the complex distribution of full Hounsfield Unit (HU) range lung CT scans is challenging and remains as a highly computationally demanding task. This paper introduces a novel decomposition strategy that synthesizes CT images one HU interval at a time, rather than modelling the entire HU domain at once. This framework focuses on training generative architectures on individual tissue-focused HU windows, then merges their output into a full-range scan via a learned reconstruction network that effectively reverses the HU-windowing process. We further propose multi-head and multi-decoder models to better capture textures while preserving anatomical consistency, with a multi-head VQVAE achieving the best performance for the generative task. Quantitative evaluation shows this approach significantly outperforms conventional 2D full-range baselines, achieving a 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. The best performance is achieved by a multi-head VQVAE variant, demonstrating that it is possible to enhance visual fidelity and variability while also reducing model complexity and computational cost. This work establishes a new paradigm for structure-aware medical image synthesis, aligning generative modelling with clinical interpretation.
摘要：目前，医学影像领域计算机辅助诊断（CAD）模型部署和验证的一个主要挑战和瓶颈是数据稀缺。对于肺癌这种全球最常见的癌症类型之一，有限的数据集可能会延迟诊断并影响患者的治疗结果。生成式 AI 为这个问题提供了一个有前途的解决方案，但处理全亨斯菲尔德单位 (HU) 范围肺部 CT 扫描的复杂分布具有挑战性，并且仍然是一项计算量要求很高的任务。本文介绍了一种新颖的分解策略，每次合成一个 HU 间隔的 CT 图像，而不是一次对整个 HU 域进行建模。该框架专注于在以个体组织为中心的 HU 窗口上训练生成架构，然后通过学习的重建网络将其输出合并到全方位扫描中，从而有效地逆转 HU 窗口过程。我们进一步提出多头和多解码器模型，以更好地捕获纹理，同时保持解剖一致性，多头 VQVAE 实现生成任务的最佳性能。定量评估表明，该方法显着优于传统的 2D 全范围基线，在所有 HU 区间的 FID 和出色的 MMD、精度和召回率方面实现了 6.2% 的改进。多头 VQVAE 变体实现了最佳性能，这表明可以增强视觉保真度和可变性，同时降低模型复杂性和计算成本。这项工作为结构感知医学图像合成建立了一个新的范例，使生成模型与临床解释保持一致。

Title: MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding

Authors: Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muhammad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, Sajid Javed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23067
Pdf URL: https://arxiv.org/pdf/2603.23067
Copy Paste: [[2603.23067]] MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding(https://arxiv.org/abs/2603.23067)
Keywords: generation
Abstract: Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textit{Cell-Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href{this https URL}{GitHub}.
摘要：全幻灯片图像 (WSI) 呈现层次结构，其中诊断信息来自细胞形态、区域组织组织和全局背景。现有的计算病理学 (CPath) 多模态大型语言模型 (MLLM) 通常将整个 WSI 压缩为单个嵌入，这阻碍了细粒度的基础，并忽略了病理学家如何跨不同尺度综合证据。我们引入了 \textbf{MLLM-HWSI}，这是一种分层 WSI 级 MLLM，它将视觉特征与病理语言在四种不同的尺度（细胞作为单词、补丁作为短语、区域作为句子和 WSI 作为段落）对齐，以支持可解释的基于证据的推理。 MLLM-HWSI 将每个 WSI 分解为具有特定尺度投影仪的多尺度嵌入，并共同强制 (i) 分层对比目标和 (ii) 跨尺度一致性损失，从而保持从单元到 WSI 的语义一致性。我们使用轻量级 \textit{Cell-Cell Attention Fusion (CCAF)} 转换器计算诊断相关的补丁，并将分段的细胞嵌入聚合到每个补丁的紧凑的细胞标记中。投影的多尺度标记与文本标记融合，并馈送到经过指令调整的 LLM，用于开放式推理、VQA、报告和标题生成任务。经过三个阶段的训练，MLLM-HWSI 在 6 个 CPath 任务的 13 个 WSI 级基准测试中取得了新的 SOTA 结果。通过将语言与多尺度视觉证据结合起来，MLLM-HWSI 提供了准确、可解释的输出，反映了诊断工作流程并促进了对 WSI 的整体理解。代码位于：\href{此 https URL}{GitHub}。

Title: Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

Authors: Orhun Buğra Baran, Melih Kandemir, Ramazan Gokberk Cinbis
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.23086
Pdf URL: https://arxiv.org/pdf/2603.23086
Copy Paste: [[2603.23086]] Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards(https://arxiv.org/abs/2603.23086)
Keywords: generation
Abstract: Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.
摘要：自回归（AR）模型对于图像生成非常有效，但其标准最大似然估计训练缺乏对样本质量和多样性的直接优化。虽然强化学习（RL）已被用来调整扩散模型，但这些方法通常会遭受输出多样性崩溃的影响。同样，AR 模型的并发 RL 方法严格依赖于实例级奖励，通常会牺牲分布覆盖率来换取质量。为了解决这些限制，我们提出了一个轻量级的 RL 框架，它将基于令牌的 AR 合成转换为马尔可夫决策过程，并通过组相对策略优化 (GRPO) 进行优化。我们的核心贡献是引入了一种新颖的分配级留一法 FID (LOO-FID) 奖励；通过利用特征矩的指数移动平均值，它明确鼓励样本多样性并防止策略更新期间的模式崩溃。我们将其与复合实例级奖励（CLIP 和 HPSv2）相结合，以实现严格的语义和感知保真度，并通过自适应熵正则化项稳定多目标学习。 LlamaGen 和 VQGAN 架构上的大量实验表明，仅在几百次调整迭代内，标准质量和多样性指标就得到了明显改进。结果还表明，即使没有无分类器指导，模型也可以更新以生成有竞争力的样本，并绕过其 2 倍的推理成本。

Title: InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Authors: Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23132
Pdf URL: https://arxiv.org/pdf/2603.23132
Copy Paste: [[2603.23132]] InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance(https://arxiv.org/abs/2603.23132)
Keywords: generation
Abstract: Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: this https URL.
摘要：尽管语音到视频合成取得了进展，但现有方法通常难以捕获跨个体依赖性并提供对二元设置中反应行为的细粒度控制。为了应对这些挑战，我们提出了 InterDyad，这是一个通过查询结构运动指导来实现自然交互式动态合成的框架。具体来说，我们首先设计一个交互注入器，它基于从参考视频中提取的与身份无关的运动先验来实现视频重演。在此基础上，我们引入了一种基于 MetaQuery 的模态对齐机制，以弥合对话音频和这些运动先验之间的差距。通过利用多模态大语言模型（MLLM），我们的框架能够从音频中提取语言意图，以指示反应的精确时间和适当性。为了进一步提高极端头部姿势下的口型同步质量，我们提出了角色感知二元高斯指导（RoDG），以增强口型同步和空间一致性。最后，我们引入了一个专用的评估套件，其具有新颖设计的指标来量化二元交互。综合实验表明，InterDyad 在产生自然且基于情境的两人互动方面明显优于最先进的方法。请参阅我们的项目页面以获取演示视频：此 https URL。

Title: DAK-UCB: Diversity-Aware Prompt Routing for LLMs and Generative Models

Authors: Donya Jafari, Farzan Farnia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.23140
Pdf URL: https://arxiv.org/pdf/2603.23140
Copy Paste: [[2603.23140]] DAK-UCB: Diversity-Aware Prompt Routing for LLMs and Generative Models(https://arxiv.org/abs/2603.23140)
Keywords: generation, generative
Abstract: The expansion of generative AI and LLM services underscores the growing need for adaptive mechanisms to select an appropriate available model to respond to a user's prompts. Recent works have proposed offline and online learning formulations to identify the optimal generative AI model for an input prompt, based solely on maximizing prompt-based fidelity evaluation scores, e.g., CLIP-Score in text-to-image generation. However, such fidelity-based selection methods overlook the diversity of generated outputs, and hence, they can fail to address potential diversity shortcomings in the generated responses. In this paper, we introduce the Diversity-Aware Kernelized Upper Confidence Bound (DAK-UCB) method as a contextual bandit algorithm for the online selection of generative models with diversity considerations. The proposed DAK-UCB method incorporates both fidelity and diversity-related metrics into the selection process. We design this framework based on prompt-aware diversity score functions that decompose to a two-sample-based expectation over prompt-output pairs in the previous generation rounds. Specifically, we illustrate the application of our framework using joint kernel distance and kernel entropy measures. Our experimental results demonstrate the effectiveness of DAK-UCB in promoting diversity-aware model selection while maintaining fidelity in the generations for a sequence of prompts. The code is available at this https URL.
摘要：生成式人工智能和法学硕士服务的扩展凸显了对自适应机制的日益增长的需求，以选择合适的可用模型来响应用户的提示。最近的工作提出了离线和在线学习公式来确定输入提示的最佳生成人工智能模型，仅基于最大化基于提示的保真度评估分数，例如文本到图像生成中的 CLIP-Score。然而，这种基于保真度的选择方法忽略了生成的输出的多样性，因此，它们可能无法解决生成的响应中潜在的多样性缺陷。在本文中，我们介绍了多样性感知核化上置信界（DAK-UCB）方法作为上下文强盗算法，用于在线选择考虑多样性的生成模型。所提出的 DAK-UCB 方法将保真度和多样性相关指标纳入选择过程。我们基于提示感知多样性评分函数设计了这个框架，该函数在前几轮中分解为基于两个样本的提示输出对的期望。具体来说，我们使用联合核距离和核熵度量来说明我们的框架的应用。我们的实验结果证明了 DAK-UCB 在促进多样性感知模型选择方面的有效性，同时保持了一系列提示的代际保真度。该代码可从此 https URL 获取。

Title: VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution

Authors: August Leander Høeg, Sophia Wiinberg Bardenfleth, Hans Martin Kjer, Tim Bjørn Dyrby, Vedrana Andersen Dahl, Anders Bjorholm Dahl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23153
Pdf URL: https://arxiv.org/pdf/2603.23153
Copy Paste: [[2603.23153]] VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution(https://arxiv.org/abs/2603.23153)
Keywords: super-resolution
Abstract: Recent advances in volumetric super-resolution (SR) have demonstrated strong performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. In this work, we show that much of this performance stems from training on downsampled data rather than real low-resolution scans. This reliance on downsampling is partly driven by the scarcity of paired high- and low-resolution 3D datasets. To address this, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: SR models trained on downsampled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying models trained on downsampled data to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans and instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available through: this https URL
摘要：体积超分辨率 (SR) 的最新进展在医学和科学成像领域展现了强大的性能，基于 Transformer 和 CNN 的方法即使在极端的缩放因子下也能取得令人印象深刻的结果。在这项工作中，我们表明这种性能很大程度上源于对下采样数据的训练，而不是真正的低分辨率扫描。这种对下采样的依赖部分是由于配对的高分辨率和低分辨率 3D 数据集的稀缺造成的。为了解决这个问题，我们引入了 VoDaSuRe，这是一个包含配对高分辨率和低分辨率扫描的大型体积数据集。当在 VoDaSuRe 上训练模型时，我们发现了一个显着的差异：在下采样数据上训练的 SR 模型比在真实低分辨率扫描上训练的模型产生的预测要清晰得多，后者可以平滑精细结构。相反，将在下采样数据上训练的模型应用于实际扫描可以保留更多结构，但不准确。我们的研究结果表明，当前的 SR 方法被夸大了——当应用于实际数据时，它们不能恢复低分辨率扫描中丢失的结构，而是预测平滑的平均值。我们认为，基于深度学习的体积 SR 的进展需要具有高复杂性的配对真实扫描的数据集，例如 VoDaSuRe。我们的数据集和代码可通过以下方式公开获得：此 https URL

Title: GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field

Authors: Jingtao Zhou, Xuan Gao, Dongyu Liu, Junhui Hou, Yudong Guo, Juyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23168
Pdf URL: https://arxiv.org/pdf/2603.23168
Copy Paste: [[2603.23168]] GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field(https://arxiv.org/abs/2603.23168)
Keywords: generative
Abstract: We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.
摘要：我们推出了 GSwap，这是一种新颖的、一致且逼真的视频换头系统，由动态神经高斯肖像先验支持，显着提高了面部和头部替换的技术水平。与之前主要依赖 2D 生成模型或 3D Morphable Face Models (3DMM) 的方法不同，我们的方法克服了它们固有的局限性，包括 3D 一致性差、面部表情不自然和合成质量受限。此外，由于整体头部建模不足和背景混合无效，现有技术难以完成完整的头部交换任务，通常会导致可见的伪影和错位。为了应对这些挑战，GSwap 引入了嵌入全身 SMPL-X 表面的固有 3D 高斯特征场，有效地将 2D 肖像视频提升为动态神经高斯场。这项创新可确保高保真、3D 一致的肖像渲染，同时保留自然的头部躯干关系和无缝运动动态。为了便于训练，我们仅使用少量参考图像将预训练的 2D 肖像生成模型适应源头部域，从而实现高效的域适应。此外，我们提出了一种神经重新渲染策略，将合成的前景与原始背景和谐地结合在一起，消除混合伪影并增强真实感。大量实验表明，GSwap 在多个方面超越了现有方法，包括视觉质量、时间连贯性、身份保存和 3D 一致性。

Title: Gimbal360: Differentiable Auto-Leveling for Canonicalized $360^\circ$ Panoramic Image Completion

Authors: Yuqin Lu, Haofeng Liu, Yang Zhou, Jun Liang, Shengfeng He, Jing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23179
Pdf URL: https://arxiv.org/pdf/2603.23179
Copy Paste: [[2603.23179]] Gimbal360: Differentiable Auto-Leveling for Canonicalized $360^\circ$ Panoramic Image Completion(https://arxiv.org/abs/2603.23179)
Keywords: generation, generative
Abstract: Diffusion models excel at 2D outpainting, but extending them to $360^\circ$ panoramic completion from unposed perspective images is challenging due to the geometric and topological mismatch between perspective projections and spherical panoramas. We present Gimbal360, a principled framework that explicitly bridges perspective observations and spherical panoramas. We introduce a Canonical Viewing Space that regularizes projective geometry and provides a consistent intermediate representation between the two domains. To anchor in-the-wild inputs to this space, we propose a Differentiable Auto-Leveling module that stabilizes feature orientation without requiring camera parameters at inference. Panoramic generation also introduces a topological challenge. Standard generative architectures assume a bounded Euclidean image plane, while Equirectangular Projection (ERP) panoramas exhibit intrinsic $S^1$ periodicity. Euclidean operations therefore break boundary continuity. We address this mismatch by enforcing topological equivariance in the latent space to preserve seamless periodic structure. To support this formulation, we introduce Horizon360, a curated large-scale dataset of gravity-aligned panoramic environments. Extensive experiments show that explicitly standardizing geometric and topological priors enables Gimbal360 to achieve state-of-the-art performance in structurally consistent $360^\circ$ scene completion.
摘要：扩散模型擅长 2D 外画，但由于透视投影和球形全景之间的几何和拓扑不匹配，将其从未摆出的透视图像扩展到 $360^\circ$ 全景完成是具有挑战性的。我们提出了 Gimbal360，一个明确连接透视观察和球形全景的原则框架。我们引入了规范化观察空间，它可以规范射影几何并在两个域之间提供一致的中间表示。为了将野外输入锚定到该空间，我们提出了一种可微分自动调平模块，该模块可以稳定特征方向，而无需在推理时使用相机参数。全景生成还引入了拓扑挑战。标准生成架构假设有界欧几里德图像平面，而等距矩形投影 (ERP) 全景图则表现出内在的 $S^1$ 周期性。因此，欧几里得运算打破了边界连续性。我们通过在潜在空间中强制拓扑等变来解决这种不匹配问题，以保持无缝的周期结构。为了支持这一表述，我们引入了 Horizon360，这是一个精心策划的重力对齐全景环境的大型数据集。大量实验表明，明确标准化几何和拓扑先验使 Gimbal360 能够在结构一致的 $360^\circ$ 场景完成中实现最先进的性能。

Title: GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models

Authors: Zekai Gu, Shuoxuan Feng, Yansong Wang, Hanzhuo Huang, Zhongshuo Du, Chengfeng Zhao, Chengwei Ren, Peng Wang, Yuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23246
Pdf URL: https://arxiv.org/pdf/2603.23246
Copy Paste: [[2603.23246]] GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models(https://arxiv.org/abs/2603.23246)
Keywords: generative
Abstract: Reconstructing a renderable 3D model from images is a useful but challenging task. Recent feedforward 3D reconstruction methods have demonstrated remarkable success in efficiently recovering geometry, but still cannot accurately model the complex appearances of these 3D reconstructed models. Recent diffusion-based generative models can synthesize realistic images or videos of an object using reference images without explicitly modeling its appearance, which provides a promising direction for object rendering, but lacks accurate control over the viewpoints. In this paper, we propose GO-Renderer, a unified framework integrating the reconstructed 3D proxies to guide the video generative models to achieve high-quality object rendering on arbitrary viewpoints under arbitrary lighting conditions. Our method not only enjoys the accurate viewpoint control using the reconstructed 3D proxy but also enables high-quality rendering in different lighting environments using diffusion generative models without explicitly modeling complex materials and lighting. Extensive experiments demonstrate that GO-Renderer achieves state-of-the-art performance across the object rendering tasks, including synthesizing images on new viewpoints, rendering the objects in a novel lighting environment, and inserting an object into an existing video.
摘要：从图像重建可渲染的 3D 模型是一项有用但具有挑战性的任务。最近的前馈 3D 重建方法在有效恢复几何形状方面取得了显着的成功，但仍然无法准确地对这些 3D 重建模型的复杂外观进行建模。最近基于扩散的生成模型可以使用参考图像合成对象的真实图像或视频，而无需显式建模其外观，这为对象渲染提供了有希望的方向，但缺乏对视点的精确控制。在本文中，我们提出了 GO-Renderer，这是一个集成了重建的 3D 代理的统一框架，用于指导视频生成模型在任意光照条件下的任意视点上实现高质量的对象渲染。我们的方法不仅可以使用重建的 3D 代理进行精确的视点控制，而且还可以使用扩散生成模型在不同的照明环境中实现高质量渲染，而无需显式地建模复杂的材质和照明。大量实验表明，GO-Renderer 在对象渲染任务中实现了最先进的性能，包括在新视点合成图像、在新颖的照明环境中渲染对象以及将对象插入到现有视频中。

Title: A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

Authors: Ruisong Zhou, Haijun Zou, Li Zhou, Chumin Sun, Zaiwen Wen
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2603.23249
Pdf URL: https://arxiv.org/pdf/2603.23249
Copy Paste: [[2603.23249]] A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling(https://arxiv.org/abs/2603.23249)
Keywords: generation
Abstract: Efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments is challenging due to resource capacities and dependencies. In practice, the need for adaptability across environments with varying resource pools and task types, alongside rapid schedule generation, complicates these challenges. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task--pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task--pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task--pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.
摘要：由于资源容量和依赖性，异构环境中的有向无环图（DAG）的有效调度具有挑战性。在实践中，对具有不同资源池和任务类型的环境的适应性的需求，以及快速生成计划，使这些挑战变得更加复杂。我们提出了 WeCAN，一种用于异构 DAG 调度的端到端强化学习框架，可解决任务池兼容性系数和生成引起的最优性差距。它采用两阶段单通道设计：单个前向通道产生任务池分数和全局参数，然后是生成映射，无需重复的网络调用即可构造调度。其加权交叉注意力编码器对由兼容性系数控制的任务池交互进行建模，并且与环境波动的大小无关。此外，广泛使用的列表调度映射可能会因可达性受限而导致生成引起的最优性差距。我们引入了一种顺序空间分析，该分析通过可行的调度顺序来表征可达的发电图集，解释了发电引起的间隙背后的机制，并产生了消除间隙的充分条件。在这些条件的指导下，我们设计了一种具有分析参数化递减跳跃规则的跳跃扩展实现，它扩大了可达顺序集，同时保持单遍效率。对计算图和现实世界 TPC-H DAG 的实验表明，在强基线上提高了完工时间，推理时间与经典启发式相当，比多轮神经调度器更快。

Title: Permutation-Symmetrized Diffusion for Unconditional Molecular Generation

Authors: Gyeonghoon Ko, Juho Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.23255
Pdf URL: https://arxiv.org/pdf/2603.23255
Copy Paste: [[2603.23255]] Permutation-Symmetrized Diffusion for Unconditional Molecular Generation(https://arxiv.org/abs/2603.23255)
Keywords: generation
Abstract: Permutation invariance is fundamental in molecular point-cloud generation, yet most diffusion models enforce it indirectly via permutation-equivariant networks on an ordered space. We propose to model diffusion directly on the quotient manifold $\tilde{\calX}=\sR^{d\times N}/S_N$, where all atom permutations are identified. We show that the heat kernel on $\tilde{\calX}$ admits an explicit expression as a sum of Euclidean heat kernels over permutations, which clarifies how diffusion on the quotient differs from ordered-particle diffusion. Training requires a permutation-symmetrized score involving an intractable sum over $S_N$; we derive an expectation form over a posterior on permutations and approximate it using MCMC in permutation space. We evaluate on unconditional 3D molecule generation on QM9 under the EQGAT-Diff protocol, using SemlaFlow-style backbone and treating all variables continuously. The results demonstrate that quotient-based permutation symmetrization is practical and yields competitive generation quality with improved efficiency.
摘要：排列不变性是分子点云生成的基础，但大多数扩散模型通过有序空间上的排列等变网络间接强制执行它。我们建议直接在商流形 $\tilde{\calX}=\sR^{d\times N}/S_N$ 上对扩散进行建模，其中所有原子排列都被识别。我们证明 $\tilde{\calX}$ 上的热核允许一个显式表达式作为排列上的欧几里得热核之和，这阐明了商上的扩散与有序粒子扩散的不同之处。训练需要排列对称分数，涉及超过 $S_N$ 的棘手总和；我们推导出排列后验的期望形式，并在排列空间中使用 MCMC 对其进行近似。我们使用 SemlaFlow 风格的主干并连续处理所有变量，在 EQGAT-Diff 协议下评估 QM9 上的无条件 3D 分子生成。结果表明，基于商的排列对称化是实用的，并且可以提高效率并产生有竞争力的发电质量。

Title: Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning

Authors: Konstantinos Barmpounakis, Theodoros P. Vagenas, Maria Vakalopoulou, George K. Matsopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23295
Pdf URL: https://arxiv.org/pdf/2603.23295
Copy Paste: [[2603.23295]] Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning(https://arxiv.org/abs/2603.23295)
Keywords: generation
Abstract: Radiotherapy workflows for oncological patients increasingly rely on multi-modal medical imaging, commonly involving both Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). MRI-only treatment planning has emerged as an attractive alternative, as it reduces patient exposure to ionizing radiation and avoids errors introduced by inter-modality registration. While nnU-Net-based frameworks are predominantly used for MRI-to-CT synthesis, we explore Mamba-based architectures for this task, aiming to showcase the advantages of state-space modeling for cross-modality translation compared to standard convolutional neural networks. Specifically, we adapt both the U-Mamba and the SegMamba architecture, originally proposed for segmentation, to perform cross-modality image generation. Our 3D Mamba architecture effectively captures complex volumetric features and long-range dependencies, thus allowing accurate CT synthesis while maintaining fast inference times. Experiments were conducted on a subset of SynthRAD2025 dataset, comprising registered single-channel MRI-CT volume pairs across three anatomical regions. Quantitative evaluation is performed via a combination of image similarity metrics computed in Hounsefield Units (HU) and segmentation-based metrics obtained from TotalSegmentator to ensure geometric consistency is preserved. The findings pave the way for the integration of state-space models into radiotherapy workflows.
摘要：肿瘤患者的放射治疗工作流程越来越依赖多模式医学成像，通常涉及磁共振成像 (MRI) 和计算机断层扫描 (CT)。仅 MRI 治疗计划已成为一种有吸引力的替代方案，因为它减少了患者对电离辐射的暴露，并避免了多模态配准带来的错误。虽然基于 nnU-Net 的框架主要用于 MRI 到 CT 的合成，但我们为此任务探索了基于 Mamba 的架构，旨在展示与标准卷积神经网络相比，跨模态转换的状态空间建模的优势。具体来说，我们采用最初提出用于分割的 U-Mamba 和 SegMamba 架构来执行跨模态图像生成。我们的 3D Mamba 架构可有效捕获复杂的体积特征和远程依赖性，从而实现准确的 CT 合成，同时保持快速的推理时间。实验在 SynthRAD2025 数据集的子集上进行，其中包括跨三个解剖区域的已注册单通道 MRI-CT 体积对。通过结合以Hounsefield 单位 (HU) 计算的图像相似性度量和从 TotalSegmentator 获得的基于分割的度量来执行定量评估，以确保保持几何一致性。这些发现为将状态空间模型整合到放射治疗工作流程中铺平了道路。

Title: Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression

Authors: V. K. Cody Bumgardner, Mitchell A. Klusty, Mahmut S. Gokmen, Evan W. Damron
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23308
Pdf URL: https://arxiv.org/pdf/2603.23308
Copy Paste: [[2603.23308]] Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression(https://arxiv.org/abs/2603.23308)
Keywords: generation
Abstract: Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum's bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.
摘要：由于极端的序列长度、严重的类别不平衡以及大型语言模型 (LLM) 倾向于忽略视觉标记而偏向语言先验，从 3D 计算机断层扫描 (CT) 体积自动生成放射学报告具有挑战性。我们提出了 Ker-VLJEPA-3B，这是一个四阶段课程学习框架，用于从胸部 CT 卷生成自由文本报告。分阶段的培训课程逐步调整 Llama 3.2 3B 解码器，使其输出基于冻结的自监督编码器的视觉特征。我们的视觉主干（LeJEPA ViT-Large）是通过对未标记的 CT 进行自监督联合嵌入预测来训练的，无需文本监督。与对比模型（CLIP、BiomedCLIP）不同，这种无语言主干网产生纯模态表示。视觉-语言的协调被推迟到课程的桥梁和生成阶段。这种与模态无关的设计可以将任何自监督编码器集成到法学硕士中，而无需在基础训练期间使用配对文本。方法创新包括：（1）区域约束交叉注意力将切片嵌入压缩为 32 个空间接地的视觉标记； (2)各向异性LLM嵌入的PCA白化；（3）仅采用积极的发现策略来消除后塌陷； (4) 热桥初始化传递投影权重；（5）选择性交叉注意力冻结与弹性权重巩固，以防止灾难性遗忘。在 CT-RATE 基准（2,984 个验证量，18 个类别）上进行评估，Ker-VLJEPA-3B 的宏 F1 为 0.429，超出最先进的技术（U-VLM，宏 F1 = 0.414）3.6%，并通过阈值优化达到 0.448（+8.2%）。消融研究证实 56.6% 的生成质量来自于患者特定的视觉内容。代码和重量可用。

Title: Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection

Authors: Rodrigo F. L. Lassance, Jasper De Bock
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.23318
Pdf URL: https://arxiv.org/pdf/2603.23318
Copy Paste: [[2603.23318]] Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection(https://arxiv.org/abs/2603.23318)
Keywords: generative
Abstract: Among the different possible strategies for evaluating the reliability of individual predictions of classifiers, robustness quantification stands out as a method that evaluates how much uncertainty a classifier could cope with before changing its prediction. However, its applicability is more limited than some of its alternatives, since it requires the use of generative models and restricts the analyses either to specific model architectures or discrete features. In this work, we propose a new robustness metric applicable to any probabilistic discriminative classifier and any type of features. We demonstrate that this new metric is capable of distinguishing between reliable and unreliable predictions, and use this observation to develop new strategies for dynamic classifier selection.
摘要：在评估分类器个体预测的可靠性的不同可能策略中，稳健性量化作为一种评估分类器在改变其预测之前可以应对多少不确定性的方法而脱颖而出。然而，它的适用性比它的一些替代方案更有限，因为它需要使用生成模型并将分析限制为特定的模型架构或离散特征。在这项工作中，我们提出了一种新的鲁棒性度量，适用于任何概率判别分类器和任何类型的特征。我们证明这个新指标能够区分可靠和不可靠的预测，并利用这一观察结果来开发动态分类器选择的新策略。

Title: ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Authors: Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23326
Pdf URL: https://arxiv.org/pdf/2603.23326
Copy Paste: [[2603.23326]] ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images(https://arxiv.org/abs/2603.23326)
Keywords: generation
Abstract: Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at this https URL.
摘要：基于 Transformer 的视频扩散模型依赖于对空间和时间标记的 3D 注意力，这会导致时间和内存复杂性成倍增加，并使超高分辨率视频的端到端训练成本过高。为了克服这个瓶颈，我们提出了一个纯图像适应框架，该框架升级了以其本机规模预训练的视频扩散变压器，以合成更高分辨率的视频。不幸的是，由于图像-视频模态差距，仅对高分辨率图像进行简单的微调通常会引入明显的噪声。为了解决这个问题，我们将学习目标解耦，分别处理模态对齐和空间外推。我们方法的核心是 Relay LoRA，这是一种两阶段适应策略。在第一阶段，视频扩散模型适应图像域，使用低分辨率图像来弥合模态差距。在第二阶段，模型进一步适应高分辨率图像以获得空间外推能力。在推理过程中，仅保留高分辨率自适应，以保留视频生成模式，同时实现高分辨率视频合成。为了增强细粒度的细节合成，我们进一步提出了高频感知训练目标，它明确鼓励模型通过专用的重建损失从退化的潜在表示中恢复高频分量。大量实验表明，我们的方法无需任何视频训练数据即可生成具有丰富视觉细节的超高分辨率视频，甚至在 VBench 基准测试中比之前在高分辨率视频上训练的最先进模型高出 0.8。代码将在此 https URL 中提供。

Title: An Explainable AI-Driven Framework for Automated Brain Tumor Segmentation Using an Attention-Enhanced U-Net

Authors: MD Rashidul Islam, Bakary Gibba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23344
Pdf URL: https://arxiv.org/pdf/2603.23344
Copy Paste: [[2603.23344]] An Explainable AI-Driven Framework for Automated Brain Tumor Segmentation Using an Attention-Enhanced U-Net(https://arxiv.org/abs/2603.23344)
Keywords: generation
Abstract: Computer-aided segmentation of brain tumors from MRI data is of crucial significance to clinical decision-making in diagnosis, treatment planning, and follow-up disease monitoring. Gliomas, owing to their high malignancy and heterogeneity, represent a very challenging task for accurate and reliable segmentation into intra-tumoral sub-regions. Manual segmentation is typically time-consuming and not reliable, which justifies the need for robust automated this http URL research resolves this problem by leveraging the BraTS 2020 dataset, where we have labeled MRI scans of glioma patients with four significant classes: background/healthy tissue, necrotic/non-enhancing core, edema, and enhancing tumor. In this work, we present a new segmentation technique based on a U-Net model augmented with executed attention gates to focus on the most significant regions of images. To counter class imbalance, we employ manually designed loss functions like Dice Loss and Categorical Dice Loss, in conjunction with standard categorical cross-entropy. Other evaluation metrics, like sensitivity and specificity, were used to measure discriminability of the model between tumor classes. Besides, we introduce Grad-CAM-based explainable AI to enable visualizing attention regions and improve model interpretability, together with a smooth heatmap generation technique through Gaussian filtering. Our approach achieved superior performance with accuracy of 0.9919, Dice coefficient of 0.9901, mean IoU of 0.9873, sensitivity of 0.9908, and specificity of 0.9974. This study demonstrates that the use of attention mechanisms, personalized loss functions, and explainable AI significantly improves highly complex tumor structure segmentation precision in MRI scans, providing a reliable and explainable method for clinical applications.
摘要：从 MRI 数据中计算机辅助分割脑肿瘤对于诊断、治疗计划和后续疾病监测等临床决策具有至关重要的意义。神经胶质瘤由于其高度恶性和异质性，对于准确可靠地分割肿瘤内亚区域来说是一项非常具有挑战性的任务。手动分割通常非常耗时且不可靠，这证明了需要强大的自动化功能。该 http URL 研究通过利用 BraTS 2020 数据集解决了这个问题，其中我们将神经胶质瘤患者的 MRI 扫描标记为四个重要类别：背景/健康组织、坏死/非增强核心、水肿和增强肿瘤。在这项工作中，我们提出了一种基于 U-Net 模型的新分割技术，并通过执行注意力门来增强，以关注图像中最重要的区域。为了应对类别不平衡，我们采用手动设计的损失函数，例如 Dice Loss 和 Categorical Dice Loss，并结合标准的分类交叉熵。其他评估指标，如敏感性和特异性，用于衡量模型在肿瘤类别之间的区分能力。此外，我们引入了基于 Grad-CAM 的可解释人工智能，以实现注意力区域可视化并提高模型可解释性，以及通过高斯滤波的平滑热图生成技术。我们的方法取得了优异的性能，准确度为 0.9919，Dice 系数为 0.9901，平均 IoU 为 0.9873，灵敏度为 0.9908，特异性为 0.9974。这项研究表明，注意力机制、个性化损失函数和可解释的人工智能的使用显着提高了 MRI 扫描中高度复杂的肿瘤结构分割精度，为临床应用提供了可靠且可解释的方法。

Title: ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Authors: Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.23376
Pdf URL: https://arxiv.org/pdf/2603.23376
Copy Paste: [[2603.23376]] ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment(https://arxiv.org/abs/2603.23376)
Keywords: generation
Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
摘要：基于视频的世界模型为具体模拟和规划提供了强大的范例，但由于对通用视觉数据和忽略物理定律的基于可能性的目标进行训练，最先进的模型经常会产生物理上难以置信的操作，例如物体穿透和反重力运动。我们推出了 ABot-PhysWorld，这是一种 14B 扩散变压器模型，可生成视觉逼真、物理合理且动作可控的视频。它建立在包含 300 万个带有物理感知注释的操作剪辑的精选数据集的基础上，使用一种新颖的基于 DPO 的后训练框架和解耦鉴别器来抑制非物理行为，同时保持视觉质量。并行上下文块可以实现精确的空间动作注入，以实现跨实施例控制。为了更好地评估泛化性，我们引入了 EZSbench，这是第一个独立于训练的零样本基准，结合了真实和合成的看不见的机器人任务场景组合。它采用解耦协议来单独评估物理真实性和动作对齐。 ABot-PhysWorld 在 PBench 和 EZSbench 上实现了最先进的性能，在物理合理性和轨迹一致性方面超越了 Veo 3.1 和 Sora v2 Pro。我们将发布EZSbench以促进具体视频生成的标准化评估。

Title: SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Authors: Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2603.23386
Pdf URL: https://arxiv.org/pdf/2603.23386
Copy Paste: [[2603.23386]] SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM(https://arxiv.org/abs/2603.23386)
Keywords: generation
Abstract: High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.
摘要：高质量的铰接式 3D 资产对于具体 AI 和物理模拟来说是不可或缺的，但 3D 生成仍然侧重于静态网格，在“模拟就绪”交互式对象方面留下了空白。最近的铰接式对象创建方法依赖于多级管道，这些管道会在解耦模块之间累积错误。或者，统一的 MLLM 提供单阶段路径来联合静态资产理解和模拟就绪资产生成。然而，基于密集体素的 3D 标记化会产生长 3D 标记序列和高内存开销，从而限制了复杂铰接对象的可扩展性。为了解决这个问题，我们提出了 SIMART，这是一个统一的 MLLM 框架，可以联合执行零件级分解和运动学预测。通过引入稀疏 3D VQ-VAE，与密集体素标记相比，SIMART 将标记数量减少了 70%，从而实现了高保真多部件装配。 SIMART 在 PartNet-Mobility 和野外 AIGC 数据集上实现了最先进的性能，并实现了基于物理的机器人仿真。

Title: Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

Authors: Michal Balcerak, Suprosana Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.23398
Pdf URL: https://arxiv.org/pdf/2603.23398
Copy Paste: [[2603.23398]] Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation(https://arxiv.org/abs/2603.23398)
Keywords: generation, generative
Abstract: Energy-based models for discrete domains, such as graphs, explicitly capture relative likelihoods, naturally enabling composable probabilistic inference tasks like conditional generation or enforcing constraints at test-time. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities. This has historically resulted in a fidelity gap relative to discrete diffusion models. We introduce Graph Energy Matching (GEM), a generative framework for graphs that closes this fidelity gap. Motivated by the transport map optimization perspective of the Jordan-Kinderlehrer-Otto (JKO) scheme, GEM learns a permutation-invariant potential energy that simultaneously provides transport-aligned guidance from noise toward data and refines samples within regions of high data likelihood. Further, we introduce a sampling protocol that leverages an energy-based switch to seamlessly bridge: (i) rapid, gradient-guided transport toward high-probability regions to (ii) a mixing regime for exploration of the learned graph distribution. On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines. Beyond sample quality, explicit modeling of relative likelihood enables targeted exploration at inference time, facilitating compositional generation, property-constrained sampling, and geodesic interpolation between graphs.
摘要：用于离散域（例如图）的基于能量的模型显式捕获相对可能性，自然地实现可组合的概率推理任务，例如条件生成或在测试时强制执行约束。然而，基于离散能量的模型通常难以实现高效和高质量的采样，因为脱离支持的区域通常包含虚假的局部最小值，从而捕获采样器并导致训练不稳定。这在历史上导致了相对于离散扩散模型的保真度差距。我们引入了图能量匹配（GEM），这是一种可以缩小保真度差距的图生成框架。受 Jordan-Kinderlehrer-Otto (JKO) 方案的传输图优化视角的启发，GEM 学习排列不变势能，同时提供从噪声到数据的传输对齐指导，并在高数据可能性区域内细化样本。此外，我们引入了一种采样协议，该协议利用基于能量的开关来无缝桥接：（i）向高概率区域的快速梯度引导传输到（ii）用于探索学习的图分布的混合机制。在分子图基准上，GEM 匹配或超过强离散扩散基线。除了样本质量之外，相对可能性的显式建模还可以在推理时进行有针对性的探索，从而促进组合生成、属性约束采样以及图之间的测地线插值。

Title: GeoSANE: Learning Geospatial Representations from Models, Not Data

Authors: Joelle Hanna, Damian Falk, Stella X. Yu, Damian Borth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23408
Pdf URL: https://arxiv.org/pdf/2603.23408
Copy Paste: [[2603.23408]] GeoSANE: Learning Geospatial Representations from Models, Not Data(https://arxiv.org/abs/2603.23408)
Keywords: generation
Abstract: Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \href{this https URL}{this http URL}.
摘要：遥感技术的最新进展导致可用基础模型数量的增加；每个模型都接受不同模式、数据集和目标的训练，但仅捕获庞大地理空间知识领域的一部分。虽然这些模型在各自的领域内显示出强劲的成果，但它们的功能仍然是互补的，而不是统一的。因此，我们的目标不是选择一种模型而不是另一种模型，而是将它们的优势结合到一个单一的共享表示中。我们介绍 GeoSANE，这是一个地理空间模型铸造厂，它从现有基础模型和特定任务模型的权重中学习统一的神经表示，能够按需生成新颖的神经网络权重。给定目标架构，GeoSANE 会生成权重，以便跨多种模式对分类、分割和检测任务进行微调。 GeoSANE 生成的模型始终优于从头开始训练的模型，匹配或超越最先进的遥感基础模型，并且在生成轻量级网络时优于通过剪枝或知识蒸馏获得的模型。对十个不同数据集和 GEO-Bench 的评估证实了其强大的泛化能力。通过从预训练转向权重生成，GeoSANE 引入了一个新框架，用于跨模型和任务统一和传输地理空间知识。代码可在 \href{此 https URL}{此 http URL} 中找到。

Title: I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation

Authors: Jia Li, Han Yan, Yihang Chen, Siqi Li, Xibin Song, Yifu Wang, Jianfei Cai, Tien-Tsin Wong, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23413
Pdf URL: https://arxiv.org/pdf/2603.23413
Copy Paste: [[2603.23413]] I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation(https://arxiv.org/abs/2603.23413)
Keywords: generation
Abstract: Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.
摘要：尽管视频生成取得了显着进展，但在重新访问先前探索的区域时保持长期场景一致性仍然具有挑战性。现有的解决方案要么依赖于显式构建 3D 几何结构，但会受到误差累积和尺度模糊的影响；要么依赖于简单的相机视场 (FoV) 检索，而这在复杂遮挡下通常会失败。为了克服这些限制，我们提出了 I3DM，这是一种新颖的隐式 3D 感知内存机制，用于绕过显式 3D 重建来生成一致的视频场景。我们方法的核心是 3D 感知记忆检索策略，它利用预先训练的前馈新颖视图合成 (FF-NVS) 模型的中间特征来对视图相关性进行评分，即使在高度遮挡的场景中也能实现稳健的检索。此外，为了充分利用检索到的历史帧，我们引入了 3D 对齐的内存注入模块。该模块隐式地将历史内容扭曲到目标视图，并自适应地调节可靠扭曲区域的生成，从而提高重访一致性和精确的摄像机控制。大量的实验表明，我们的方法优于最先进的方法，实现了卓越的重访一致性、生成保真度和相机控制精度。

Title: SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Authors: Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23414
Pdf URL: https://arxiv.org/pdf/2603.23414
Copy Paste: [[2603.23414]] SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling(https://arxiv.org/abs/2603.23414)
Keywords: generation
Abstract: Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.
摘要：扩展强化学习 (RL) 在增强大型语言模型 (LLM) 的推理能力方面显示出强大的前景，特别是在需要长思维链生成的任务中。然而，RL 训练效率通常受到推出阶段的瓶颈，由于自回归生成缓慢以及推出和策略更新之间的同步开销，在生成长轨迹（例如 16k 个令牌）时，推出阶段可能占总训练时间的 70%。我们提出了 SortedRL，这是一种在线长度感知调度策略，旨在通过提高推出效率和保持训练稳定性来解决这一瓶颈。 SortedRL 根据输出长度重新排序推出样本，优先考虑形成组的短样本以进行早期更新。这使得大批量推出、灵活更新批量以及近乎政策性的微课程建设同时得以实现。为了进一步加速管道，SortedRL 采用了一种机制，通过基于缓存的机制来控制离策略训练的程度，并得到专用 RL 基础设施的支持，该基础设施通过有状态控制器和 rollout 缓冲区来管理推出和更新。使用 LLaMA-3.1-8B 和 Qwen-2.5-32B 执行各种任务（包括逻辑谜题以及 AIME 24、Math 500 和 Minerval 等数学挑战）的实验表明，SortedRL 将 RL 训练气泡率降低了 50% 以上，同时在相同数据量的情况下，与基线相比，性能提高了 3.9% 至 18.4%。

Title: RealMaster: Lifting Rendered Scenes into Photorealistic Video

Authors: Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23462
Pdf URL: https://arxiv.org/pdf/2603.23462
Copy Paste: [[2603.23462]] RealMaster: Lifting Rendered Scenes into Photorealistic Video(https://arxiv.org/abs/2603.23462)
Keywords: generation
Abstract: State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.
摘要：最先进的视频生成模型可产生卓越的照片级真实感，但它们缺乏使生成的内容与特定场景要求保持一致所需的精确控制。此外，如果没有底层的显式几何结构，这些模型就无法保证 3D 一致性。相反，3D 引擎提供对每个场景元素的精细控制，并通过设计提供原生 3D 一致性，但它们的输出常常陷入“恐怖谷”。弥合这种模拟与真实的差距需要结构精度（输出必须准确保留输入的几何形状和动态）和全局语义转换（其中材质、照明和纹理必须整体转换以实现照片级真实感）。我们提出了 RealMaster，这是一种利用视频扩散模型将渲染视频提升为逼真视频的方法，同时保持与 3D 引擎的输出完全对齐。为了训练该模型，我们通过基于锚点的传播策略生成配对数据集，其中第一帧和最后一帧经过增强以实现真实感，并使用几何条件提示在中间帧之间传播。然后，我们在这些配对视频上训练 IC-LoRA，将管道的高质量输出提炼成一个模型，该模型可以超越管道的限制，处理出现在序列中间的对象和角色，并无需锚帧即可进行推理。在复杂的 GTA-V 序列上进行评估后，RealMaster 显着优于现有的视频编辑基线，提高了照片真实感，同时保留了原始 3D 控件指定的几何形状、动态和特性。

Title: InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Authors: Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen, Khoi Nguyen, Cuong Pham, Anh Tran
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23463
Pdf URL: https://arxiv.org/pdf/2603.23463
Copy Paste: [[2603.23463]] InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting(https://arxiv.org/abs/2603.23463)
Keywords: generation
Abstract: Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.
摘要：最近的基于扩散的模型在图像修复中实现了照片级真实感，但需要许多采样步骤，限制了实际使用。少步文本到图像模型提供了更快的生成速度，但天真地将它们应用于修复会导致背景和修复区域之间的协调性差和伪影。我们将这个原因追溯到随机高斯噪声初始化，在低函数评估下会导致语义错位和保真度降低。为了克服这个问题，我们提出了 InverFill，这是一种专为修复而设计的一步反转方法，它将输入掩模图像中的语义信息注入到初始噪声中，从而实现高保真度的几步修复。 InverFill 不是训练修复模型，而是在混合采样管道中利用几个步骤的文本到图像模型，并以语义对齐的噪声作为输入，显着改进了普通混合采样，甚至在低 NFE 下匹配专门的修复模型。此外，InverFill 不需要真实图像监督，只增加最小的推理开销。大量实验表明，InverFill 始终如一地增强基线少步模型，提高图像质量和文本连贯性，而无需昂贵的再训练或大量迭代优化。

Title: One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Authors: Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23488
Pdf URL: https://arxiv.org/pdf/2603.23488
Copy Paste: [[2603.23488]] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation(https://arxiv.org/abs/2603.23488)
Keywords: generation
Abstract: Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at this https URL.
摘要：单目新颖视图合成长期以来需要多视图图像对进行监督，限制了训练数据的规模和多样性。我们认为这是没有必要的：一种观点就足够了。我们展示 OVIE，完全基于不配对的互联网图像进行训练。我们在训练时利用单目深度估计器作为几何支架：我们将源图像提升为 3D，应用采样相机变换，并投影以获得伪目标视图。为了处理遮挡问题，我们引入了一种掩模训练公式，将几何、感知和纹理损失限制在有效区域，从而能够对 3000 万张未经整理的图像进行训练。据推断，OVIE 是不受几何影响的，不需要深度估计器或 3D 表示。 OVIE 专门针对野外图像进行训练，在零样本设置中优于先前的方法，同时比第二最佳基线快 600 倍。代码和模型可通过此 https URL 公开获得。

Title: Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Authors: Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23491
Pdf URL: https://arxiv.org/pdf/2603.23491
Copy Paste: [[2603.23491]] Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation(https://arxiv.org/abs/2603.23491)
Keywords: generation
Abstract: Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.
摘要：扩散和流匹配模型解锁了前所未有的创意内容创建功能，例如交互式图像和流视频生成。然而，对更高分辨率、帧速率和上下文长度的需求不断增长，使得高效生成变得越来越具有挑战性，因为计算复杂性随着生成的令牌数量呈二次方增长。我们的工作旨在优化用户注视位置已知或可以估计的设置中的生成过程的效率，例如通过使用眼动追踪。在这些设置中，我们利用人类视觉的偏心率相关的敏锐度：当用户在其注视位置（中央凹区域）周围的小区域中感知非常高分辨率的视觉信息时，解析细节的能力在视野的外围迅速降低。我们的方法从对中心凹分辨率进行建模的掩模开始，以非均匀地分配令牌，将较高的令牌密度分配给中心凹区域，并将较低的密度分配给外围区域。图像或视频是在混合分辨率令牌设置中生成的，产生的结果在感知上与全分辨率生成没有区别，同时大大减少了令牌数量和生成时间。为此，我们开发了一种原则性机制，用于直接从高分辨率数据构建混合分辨率令牌，允许从现有基础模型对注视点扩散模型进行后训练，同时保持跨分辨率的内容一致性。我们通过广泛的分析和精心设计的用户研究验证了我们的方法，证明了注视点作为高效生成的实用且可扩展的轴的功效。

Title: WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Authors: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23497
Pdf URL: https://arxiv.org/pdf/2603.23497
Copy Paste: [[2603.23497]] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG(https://arxiv.org/abs/2603.23497)
Keywords: generation, generative
Abstract: Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is this https URL.
摘要：动力系统理论和强化学习将世界演化视为由行为驱动的潜在状态动态，视觉观察提供了有关状态的部分信息。最近的视频世界模型试图从数据中学习这种以动作为条件的动态。然而，现有的数据集很少满足要求：它们通常缺乏多样化且语义上有意义的动作空间，并且动作直接与视觉观察相关，而不是由底层状态介导。因此，动作常常与像素级的变化纠缠在一起，使得模型很难学习结构化的世界动态并在长期范围内保持一致的进化。在本文中，我们提出了 WildWorld，这是一个具有明确状态注释的大规模动作条件世界建模数据集，是从逼真的 AAA 动作角色扮演游戏（《怪物猎人：荒野》）中自动收集的。 WildWorld 包含超过 1.08 亿帧，具有超过 450 个动作，包括移动、攻击和技能施放，以及角色骨架、世界状态、相机姿势和深度图的同步每帧注释。我们进一步推导出 WildBench，通过动作跟踪和状态对齐来评估模型。大量的实验揭示了对语义丰富的动作进行建模和保持长期状态一致性方面持续存在的挑战，这凸显了对状态感知视频生成的需求。项目页面就是这个 https URL。

Title: DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Authors: Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23499
Pdf URL: https://arxiv.org/pdf/2603.23499
Copy Paste: [[2603.23499]] DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models(https://arxiv.org/abs/2603.23499)
Keywords: restoration
Abstract: Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.
摘要：当面对现实世界的损坏（例如模糊、噪声和压缩伪影）时，基于高质量数据训练的光流模型通常会严重退化。为了克服这一限制，我们制定了退化感知光流，这是一项新任务，旨在从现实世界的损坏视频中进行准确的密集对应估计。我们的主要见解是，图像恢复扩散模型的中间表示本质上是腐败感知的，但缺乏时间感知。为了解决这个限制，我们提升模型通过完整的时空注意力来参与相邻帧，并凭经验证明所得到的特征表现出零样本对应能力。基于这一发现，我们提出了 DA-Flow，这是一种混合架构，它将这些扩散特征与迭代细化框架内的卷积特征融合在一起。在多个基准测试严重退化的情况下，DA-Flow 的性能大大优于现有的光流方法。

Title: UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Authors: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.23500
Pdf URL: https://arxiv.org/pdf/2603.23500
Copy Paste: [[2603.23500]] UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation(https://arxiv.org/abs/2603.23500)
Keywords: generation
Abstract: Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.
摘要：能够交错生成的统一模型已成为一种有前景的范例，随着社区越来越多地关注用于文本的自回归建模和用于图像生成的流匹配。为了推进这个方向，我们提出了一个为交错生成量身定制的统一强化学习框架。我们在其基本单元上验证了我们的方法：单轮推理驱动的图像生成，其中模型首先通过推理扩展用户提示，然后进行图像合成。将这种多模态生成过程表述为具有稀疏终端奖励的马尔可夫决策过程，我们引入 UniGRPO 来使用 GRPO 联合优化文本和图像生成策略。我们采用极简主义方法来避免过度设计，通过无缝集成用于推理的标准 GRPO 和用于视觉合成的 FlowGRPO，利用两种模式的既定培训方案。为了确保多轮交错生成的可扩展性，我们对原始 FlowGRPO 进行了两项关键修改：（1）消除无分类器指导以维持线性、无分支的推出，这对于扩展到涉及多轮交互和多条件生成（例如编辑）的复杂场景至关重要； (2) 将标准潜在 KL 惩罚替换为直接在速度场上的 MSE 惩罚，提供更稳健和直接的正则化信号，以有效减轻奖励黑客行为。我们的实验表明，这种统一的训练方法通过推理显着提高了图像生成质量，为未来完全交错模型的后期训练提供了稳健且可扩展的基线。

Title: MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Authors: Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.23501
Pdf URL: https://arxiv.org/pdf/2603.23501
Copy Paste: [[2603.23501]] MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage(https://arxiv.org/abs/2603.23501)
Keywords: generation
Abstract: Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
摘要：视觉语言模型 (VLM) 越来越多地用于医疗报告生成和视觉问答等任务。然而，流畅的诊断文本并不能保证安全的视觉理解。在临床实践中，解释从诊断前的健全性检查开始：验证输入的读取是否有效（正确的形态和解剖结构、合理的观点和方向，以及没有明显的完整性违规）。现有的基准在很大程度上假设这一步已经解决，因此错过了一个关键的失败模式：即使输入不一致或无效，模型也可以产生合理的叙述。我们引入了 MedObvious，这是一个包含 1,880 个任务的基准测试，它将输入验证隔离为小型多面板图像集的集级一致性功能：模型必须识别是否有任何面板违反了预期的一致性。 MedObvious 跨越五个渐进层，从基本方向/模态不匹配到临床动机的解剖/观点验证和分类式提示，并包括五种评估格式来测试跨界面的稳健性。通过评估 17 个不同的 VLM，我们发现健全性检查仍然不可靠：一些模型在正常（阴性对照）输入上出现异常，当缩放到更大的图像集时性能会下降，并且测量的准确性在多项选择和开放式设置之间存在很大差异。这些结果表明，医疗 VLM 的预诊断验证仍未解决，在部署前应将其视为一种独特的、安全关键的功能。