2026-03-30

Title: A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

Authors: Changyu Liu, James Chenhao Liang, Wenhao Yang, Yiming Cui, Jinghao Yang, Tianyang Wang, Qifan Wang, Dongfang Liu, Cheng Han
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2603.25758
Pdf URL: https://arxiv.org/pdf/2603.25758
Copy Paste: [[2603.25758]] A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning(https://arxiv.org/abs/2603.25758)
Keywords: generative
Abstract: Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT's most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.
摘要：扩散模型极大地重塑了生成人工智能领域，并且现在越来越多地探索其在判别表示学习方面的能力。 Diffusion Transformer (DiT) 最近作为传统基于 U-Net 的扩散模型的有前途的替代方案而受到关注，它展示了通过生成预训练进行下游判别任务的有希望的途径。然而，由于时间步搜索不足和 DiT 特定特征表示的利用不足，其当前的训练效率和表示能力仍然受到很大限制。鉴于这种观点，我们引入了自动选择时间步长（A-SelecT），它可以在一次运行中从选定的变换器特征中动态地查明 DiT 信息最丰富的时间步长，从而消除了计算密集型详尽时间步长搜索和次优判别性特征选择的需要。关于分类和分割基准的大量实验表明，在 A-SelecteT 的支持下，DiT 高效且有效地超越了之前所有基于扩散的尝试。

Title: Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification

Authors: Binwei Chen, Huachao Leng, Chi Yeung Mang, Tsz Wai Cheung, Yanhua Chen, Wai Keung Anthony Loh, Chi Ho Wong, Chak Yin Tang
Subjects: cs.CV, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2603.25765
Pdf URL: https://arxiv.org/pdf/2603.25765
Copy Paste: [[2603.25765]] Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification(https://arxiv.org/abs/2603.25765)
Keywords: generative
Abstract: Hard coatings play a critical role in industry, with ceramic materials offering outstanding hardness and thermal stability for applications that demand superior mechanical performance. However, deploying artificial intelligence (AI) for surface roughness classification is often constrained by the need for large labeled datasets and costly high-resolution imaging equipment. In this study, we explore the use of synthetic images, generated with Stable Diffusion XL, as an efficient alternative or supplement to experimentally acquired data for classifying ceramic surface roughness. We show that augmenting authentic datasets with generative images yields test accuracies comparable to those obtained using exclusively experimental images, demonstrating that synthetic images effectively reproduce the structural features necessary for classification. We further assess method robustness by systematically varying key training hyperparameters (epoch count, batch size, and learning rate), and identify configurations that preserve performance while reducing data requirements. Our results indicate that generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering a practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.
摘要：硬质涂层在工业中发挥着至关重要的作用，陶瓷材料为需要卓越机械性能的应用提供出色的硬度和热稳定性。然而，部署人工智能 (AI) 进行表面粗糙度分类通常受到对大型标记数据集和昂贵的高分辨率成像设备的需求的限制。在本研究中，我们探索使用通过 Stable Diffusion XL 生成的合成图像作为对陶瓷表面粗糙度进行分类的实验获取数据的有效替代或补充。我们证明，用生成图像增强真实数据集所产生的测试精度与仅使用实验图像获得的测试精度相当，证明合成图像有效地再现了分类所需的结构特征。我们通过系统地改变关键训练超参数（历元数、批量大小和学习率）来进一步评估方法的稳健性，并确定在减少数据需求的同时保持性能的配置。我们的结果表明，生成式人工智能可以显着提高材料图像分类工作流程中的数据效率和可靠性，为降低实验成本、加速模型开发和扩大人工智能在材料工程中的适用性提供了一条实用途径。

Title: Pure and Physics-Guided Deep Learning Solutions for Spatio-Temporal Groundwater Level Prediction at Arbitrary Locations

Authors: Matteo Salis, Gabriele Sartor, Rosa Meo, Stefano Ferraris, Abdourrahmane M. Atto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25779
Pdf URL: https://arxiv.org/pdf/2603.25779
Copy Paste: [[2603.25779]] Pure and Physics-Guided Deep Learning Solutions for Spatio-Temporal Groundwater Level Prediction at Arbitrary Locations(https://arxiv.org/abs/2603.25779)
Keywords: generation
Abstract: Groundwater represents a key element of the water cycle, yet it exhibits intricate and context-dependent relationships that make its modeling a challenging task. Theory-based models have been the cornerstone of scientific understanding. However, their computational demands, simplifying assumptions, and calibration requirements limit their use. In recent years, data-driven models have emerged as powerful alternatives. In particular, deep learning has proven to be a leading approach for its design flexibility and ability to learn complex relationships. We proposed an attention-based pure deep learning model, named STAINet, to predict weekly groundwater levels at an arbitrary and variable number of locations, leveraging both spatially sparse groundwater measurements and spatially dense weather information. Then, to enhance the model's trustworthiness and generalization ability, we considered different physics-guided strategies to inject the groundwater flow equation into the model. Firstly, in the STAINet-IB, by introducing an inductive bias, we also estimated the governing equation components. Then, by adopting a learning bias strategy, we proposed the STAINet-ILB, trained with additional loss terms adding supervision on the estimated equation components. Lastly, we developed the STAINet-ILRB, leveraging the groundwater body recharge zone information estimated by domain experts. The STAINet-ILB performed the best, achieving overwhelming test performances in a rollout setting (median MAPE 0.16%, KGE 0.58). Furthermore, it predicted sensible equation components, providing insights into the model's physical soundness. Physics-guided approaches represent a promising opportunity to enhance both the generalization ability and the trustworthiness, thereby paving the way to a new generation of disruptive hybrid deep learning Earth system models.
摘要：地下水是水循环的关键要素，但它表现出复杂且依赖于环境的关系，这使得其建模成为一项具有挑战性的任务。基于理论的模型一直是科学理解的基石。然而，它们的计算需求、简化假设和校准要求限制了它们的使用。近年来，数据驱动模型已成为强大的替代方案。特别是，深度学习因其设计灵活性和学习复杂关系的能力而被证明是一种领先的方法。我们提出了一种基于注意力的纯深度学习模型，名为 STAINet，利用空间稀疏的地下水测量值和空间密集的天气信息来预测任意和可变数量位置的每周地下水位。然后，为了增强模型的可信度和泛化能力，我们考虑了不同的物理引导策略将地下水流方程注入模型中。首先，在 STAINet-IB 中，通过引入归纳偏差，我们还估计了控制方程分量。然后，通过采用学习偏差策略，我们提出了 STAINet-ILB，使用额外的损失项进行训练，增加对估计方程组件的监督。最后，我们利用领域专家估计的地下水体补给区信息开发了 STAINet-ILRB。 STAINet-ILB 表现最好，在首次部署环境中实现了压倒性的测试性能（中位 MAPE 0.16%，KGE 0.58）。此外，它还预测了合理的方程分量，为模型的物理可靠性提供了见解。物理引导方法代表了增强泛化能力和可信度的有前途的机会，从而为新一代颠覆性混合深度学习地球系统模型铺平了道路。

Title: MAGNET: Autonomous Expert Model Generation via Decentralized Autoresearch and BitNet Training

Authors: Yongwan Kim, Sungchul Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25813
Pdf URL: https://arxiv.org/pdf/2603.25813
Copy Paste: [[2603.25813]] MAGNET: Autonomous Expert Model Generation via Decentralized Autoresearch and BitNet Training(https://arxiv.org/abs/2603.25813)
Keywords: generation
Abstract: We present MAGNET (Model Autonomously Growing Network), a decentralized system for autonomous generation, training, and serving of domain-expert language models across commodity hardware. MAGNET integrates four components: (1) autoresearch, an autonomous ML research pipeline that automates dataset generation, hyperparameter exploration, evaluation, and error-driven iteration; (2) BitNet b1.58 ternary training, enabling CPU-native inference via this http URL without GPU hardware; (3) DiLoCo-based distributed merging for communication-efficient aggregation of domain specialists; and (4) on-chain contribution tracking on the HOOTi EVM chain. We validate autoresearch through three case studies: video safety classification (balanced accuracy 0.9287 to 0.9851), cryptocurrency directional prediction (41% to 54.9% hit rate), and BitNet hyperparameter optimization (10-phase sweep, -16.7% validation loss).
摘要：我们推出了 MAGNET（模型自主增长网络），这是一个去中心化系统，用于跨商用硬件自主生成、训练和服务领域专家语言模型。 MAGNET 集成了四个组件：(1) autoresearch，一个自主的 ML 研究管道，可自动生成数据集、超参数探索、评估和错误驱动迭代； (2) BitNet b1.58 三元训练，无需 GPU 硬件即可通过此 http URL 实现 CPU 原生推理； (3)基于DiLoCo的分布式合并，用于领域专家的通信高效聚合； (4) HOOTi EVM 链上的链上贡献跟踪。我们通过三个案例研究来验证自动研究：视频安全分类（平衡精度 0.9287 至 0.9851）、加密货币定向预测（41% 至 54.9% 命中率）和 BitNet 超参数优化（10 相扫描，-16.7% 验证损失）。

Title: ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Authors: Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25823
Pdf URL: https://arxiv.org/pdf/2603.25823
Copy Paste: [[2603.25823]] ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?(https://arxiv.org/abs/2603.25823)
Keywords: generation, generative
Abstract: Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at this https URL
摘要：现代 AIGC 模型令人惊叹的视觉保真度背后隐藏着一个“逻辑沙漠”，系统无法完成需要物理、因果或复杂空间推理的任务。当前的评估很大程度上依赖于肤浅的指标或零碎的基准，造成了忽视生成过程的“绩效幻象”。为了解决这个问题，我们引入了 ViGoR Vision-G（以生成推理为中心的基准），这是一个旨在消除这种幻象的统一框架。 ViGoR 通过四项关键创新脱颖而出：1）桥接图像到图像和视频任务的整体跨模式覆盖； 2）中间过程和最终结果双轨评价机制； 3）基于证据的自动法官确保高度的人类一致性； 4) 粒度诊断分析，将性能分解为细粒度的认知维度。对 20 多个领先模型的实验表明，即使是最先进的系统也存在严重的推理缺陷，因此将 ViGoR 确立为下一代智能视觉模型的关键“压力测试”。该演示已在此 https URL 提供

Title: Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception

Authors: Jingpei Lu, Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Omid Mohareri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25867
Pdf URL: https://arxiv.org/pdf/2603.25867
Copy Paste: [[2603.25867]] Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception(https://arxiv.org/abs/2603.25867)
Keywords: generation
Abstract: Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.
摘要：微创和机器人辅助手术严重依赖内窥镜成像，但电灼和血管封闭器械产生的手术烟雾会严重降低视觉感知并阻碍基于视觉的功能。我们提出了一种基于变压器的手术除烟模型，具有受物理启发的除烟头，可联合预测无烟图像和相应的烟雾图。为了解决有烟与无烟配对训练数据的稀缺问题，我们开发了一种合成数据生成管道，将人造烟雾模式与真实内窥镜图像混合在一起，生成超过 80,000 个配对样本用于监督训练。据我们所知，我们进一步整理了迄今为止最大的配对手术烟雾数据集，其中包括用达芬奇机器人手术系统捕获的 5,817 个图像对，从而能够对高分辨率内窥镜图像进行基准测试。对公共基准和我们的数据集进行的大量实验表明，与现有的去雾和去烟方法相比，图像重建具有最先进的性能。我们还评估了除烟对下游立体深度估计和仪器分割的影响，强调了数字除烟方法的潜在好处和当前局限性。

Title: Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations

Authors: Suraj Prasad, Pinak Mahapatra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.25870
Pdf URL: https://arxiv.org/pdf/2603.25870
Copy Paste: [[2603.25870]] Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations(https://arxiv.org/abs/2603.25870)
Keywords: generation
Abstract: Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.
摘要：创建白板式教育视频需要徒手插图和口头叙述之间的精确协调，但现有方法还没有通过结构化、可复制的绘图表示来解决这种多模式同步问题。我们展示了第一个包含 24 个配对 Excalidraw 演示的数据集，其中每个绘图元素都带有跨越 8 个 STEM 领域的毫秒精度创建时间戳。利用这些数据，我们研究了通过 LoRA 微调的视觉语言模型 (Qwen2-VL-7B) 是否可以仅通过 24 个演示来预测与语音同步的完整笔划序列。我们的主题分层五重评估表明，时间戳调节显着改善了消融基线上的时间对齐，而模型则概括了未见过的 STEM 主题。我们讨论了向真实课堂环境的可转移性，并发布了我们的数据集和代码以支持自动化教育内容生成的未来研究。

Title: DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease

Authors: Runsheng Bai, Chengyu Zhang, Yangdong Deng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.25872
Pdf URL: https://arxiv.org/pdf/2603.25872
Copy Paste: [[2603.25872]] DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease(https://arxiv.org/abs/2603.25872)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip transitions to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of $\tfrac{1}{n}$ or $\tfrac{2}{n+1}$, depending on whether the conservative or aggressive mode is used, where $n$ denotes the number of devices. Empirically, DRiffusion attains 1.4$\times$-3.7$\times$ speedup across multiple diffusion models while incur minimal degradation in generation quality: on MS-COCO dataset, both FID and CLIP remain largely on par with those of the original model, while PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43, respectively. These results verify that DRiffusion delivers substantial acceleration and preserves perceptual quality.
摘要：扩散模型在生成高保真内容方面取得了显着的成功，但受到缓慢的迭代采样的影响，导致高延迟，限制了它们在交互式应用程序中的使用。我们引入了 DRiffusion，这是一个并行采样框架，可通过草稿和细化过程并行化扩散推理。 DRiffusion 采用跳跃过渡来生成未来时间步的多个草稿状态，并并行计算其相应的噪声，然后将其用于标准去噪过程以产生精细的结果。理论上，我们的方法可以实现 $\tfrac{1}{n}$ 或 $\tfrac{2}{n+1}$ 的加速率，具体取决于使用保守模式还是激进模式，其中 $n$ 表示设备数量。根据经验，DRiffusion 在多个扩散模型中实现了 1.4$\times$-3.7$\times$ 加速，同时生成质量的下降最小：在 MS-COCO 数据集上，FID 和 CLIP 基本上与原始模型保持一致，而 PickScore 和 HPSv2.1 仅显示出较小的平均下降，分别为 0.17 和 0.43。这些结果验证了 DRiffusion 提供了显着的加速并保持了感知质量。

Title: Automated Quality Assessment of Blind Sweep Obstetric Ultrasound for Improved Diagnosis

Authors: Prasiddha Bhandari, Kanchan Poudel, Nishant Luitel, Bishram Acharya, Angelina Ghimire, Tyler Wellman, Kilian Koepsell, Pradeep Raj Regmi, Bishesh Khanal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25886
Pdf URL: https://arxiv.org/pdf/2603.25886
Copy Paste: [[2603.25886]] Automated Quality Assessment of Blind Sweep Obstetric Ultrasound for Improved Diagnosis(https://arxiv.org/abs/2603.25886)
Keywords: quality assessment
Abstract: Blind Sweep Obstetric Ultrasound (BSOU) enables scalable fetal imaging in low-resource settings by allowing minimally trained operators to acquire standardized sweep videos for automated Artificial Intelligence(AI) interpretation. However, the reliability of such AI systems depends critically on the quality of the acquired sweeps, and little is known about how deviations from the intended protocol affect downstream predictions. In this work, we present a systematic evaluation of BSOU quality and its impact on three key AI tasks: sweep-tag classification, fetal presentation classification, and placenta-location classification. We simulate plausible acquisition deviations, including reversed sweep direction, probe inversion, and incomplete sweeps, to quantify model robustness, and we develop automated quality-assessment models capable of detecting these perturbations. To approximate real-world deployment, we simulate a feedback loop in which flagged sweeps are re-acquired, showing that such correction improves downstream task performance. Our findings highlight the sensitivity of BSOU-based AI models to acquisition variability and demonstrate that automated quality assessment can play a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.
摘要：盲扫产科超声 (BSOU) 允许经过最低限度培训的操作员获取标准化扫描视频以进行自动人工智能 (AI) 解释，从而在资源匮乏的情况下实现可扩展的胎儿成像。然而，此类人工智能系统的可靠性在很大程度上取决于所获取扫描的质量，并且对于与预期协议的偏差如何影响下游预测知之甚少。在这项工作中，我们对 BSOU 质量及其对三个关键 AI 任务的影响进行了系统评估：扫描标签分类、胎儿先露分类和胎盘位置分类。我们模拟合理的采集偏差，包括反向扫描方向、探头反转和不完整扫描，以量化模型的稳健性，并开发能够检测这些扰动的自动质量评估模型。为了近似现实世界的部署，我们模拟了一个反馈循环，其中重新获取标记的扫描，表明这种校正提高了下游任务性能。我们的研究结果强调了基于 BSOU 的 AI 模型对采集变异性的敏感性，并证明自动化质量评估可以在构建可靠、可扩展的 AI 辅助产前超声工作流程中发挥核心作用，特别是在资源匮乏的环境中。

Title: World Reasoning Arena

Authors: PAN Team Institute of Foundation Models: Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu, Dequan Yang, Junrong Chen, Arif Ahmad, Cong Zeng, Ganesh Bannur, Xinqi Huang, Zheqi Liu, Yi Gu, Yichi Yang, Guangyi Liu, Zhiting Hu, Zhengzhong Liu, Eric Xing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25887
Pdf URL: https://arxiv.org/pdf/2603.25887
Copy Paste: [[2603.25887]] World Reasoning Arena(https://arxiv.org/abs/2603.25887)
Keywords: generation
Abstract: World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at this https URL.
摘要：世界模型 (WM) 旨在充当现实世界的内部模拟器，使代理能够理解、预测复杂环境并采取行动。现有的 WM 基准仍然狭隘地关注下一状态预测和视觉保真度，忽视了智能行为所需的更丰富的模拟功能。为了解决这一差距，我们引入了 WR-Arena，这是一个综合基准，用于沿着下一个世界模拟的三个基本维度评估 WM：（i）动作模拟保真度，解释和遵循语义上有意义的多步骤指令并生成各种反事实部署的能力； (ii) 长期预测，即在广泛的交互作用中维持准确、连贯且物理上合理的模拟的能力； (iii) 模拟推理和规划，通过在结构化和开放式环境中模拟、比较和选择替代未来来支持目标导向推理的能力。我们构建了一个任务分类法并整理了旨在探索这些功能的各种数据集，超越了单轮和感知评估。通过对最先进的 WM 进行广泛的实验，我们的结果揭示了当前模型与人类水平的假设推理之间的巨大差距，并将 WR-Arena 建立为诊断工具和指南，以推进能够强大理解、预测和有目的行动的下一代世界模型。该代码可从此 https URL 获取。

Title: DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Authors: Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, Ser-Nam Lim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25931
Pdf URL: https://arxiv.org/pdf/2603.25931
Copy Paste: [[2603.25931]] DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation(https://arxiv.org/abs/2603.25931)
Keywords: generation
Abstract: Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.
摘要：流匹配视频生成器产生时间连贯的高保真输出，但通常违反基本物理，因为它们的重建目标会惩罚每帧偏差，而不区分物理一致的动态与不可能的动态。对比流匹配通过推开不同条件下的速度场轨迹提供了原则性的补救措施，但我们发现了文本条件视频设置中的一个基本障碍：语义物理纠缠。由于自然语言提示将场景内容与物理行为结合起来，因此朴素负采样绘制的速度场与正样本的速度场很大程度上重叠的条件，导致对比梯度直接与流匹配目标相反。我们将这种梯度冲突形式化，得出精确的对齐条件，揭示对比学习何时有助于训练，何时有害训练。在此分析的指导下，我们引入了 DiReCT（对比轨迹的解缠正则化），这是一个轻量级的后训练框架，它将对比信号分解为两个互补的尺度：一个宏观对比项，从语义遥远的区域中提取分区排除负数，以实现无干扰的全局轨迹分离；以及一个微观对比项，构建与正样本共享全场景语义但沿单个 LLM 扰动轴不同的硬负数。身体行为；涵盖运动学、力、材料、相互作用和大小。速度空间分布正则化器有助于防止灾难性地忘记预先训练的视觉质量。当应用于 Wan 2.1-1.3B 时，与基线和 SFT 相比，我们的方法在 VideoPhy 上的物理常识分数分别提高了 16.7% 和 11.3%，而无需增加训练时间。

Title: JRM: Joint Reconstruction Model for Multiple Objects without Alignment

Authors: Qirui Wu, Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Richard Newcombe, Angel X. Chang, Jakob Engel, Henry Howard-Jenkins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25985
Pdf URL: https://arxiv.org/pdf/2603.25985
Copy Paste: [[2603.25985]] JRM: Joint Reconstruction Model for Multiple Objects without Alignment(https://arxiv.org/abs/2603.25985)
Keywords: generation, generative
Abstract: Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.
摘要：以对象为中心的重建旨在通过独立对象的组合来恢复场景的 3D 结构。虽然这种独立性可以简化建模，但它丢弃了可以改善重建的强信号，特别是在场景中或跨扫描中多次看到相同对象模型的情况下的重复。我们提出联合重建模型（JRM），通过将对象重建框架作为个性化生成之一来利用重复：多个观察共享一个共同的主题，该主题对于所有观察应该是一致的，同时仍然遵循每个观察的特定姿势和状态。这个方向上的现有方法依赖于观察之间的显式匹配和刚性对齐，这使得它们对错误敏感并且难以扩展到非刚性变换。相比之下，JRM 是一种 3D 流匹配生成模型，它隐式聚合其潜在空间中未对齐的观察结果，学习以数据驱动的方式生成一致且忠实的重建，而无需明确的约束。对合成数据和真实世界数据的评估表明，JRM 的隐式聚合消除了显式对齐的需要，提高了对不正确关联的鲁棒性，并自然地处理诸如发音之类的非刚性变化。总体而言，JRM 在重建质量方面优于独立基线和基于对齐的基线。

Title: Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Authors: Zhuan Shi, Alireza Dehghanpour Farashah, Rik de Vries, Golnoosh Farnadi
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2603.25994
Pdf URL: https://arxiv.org/pdf/2603.25994
Copy Paste: [[2603.25994]] Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2603.25994)
Keywords: generative
Abstract: Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
摘要：文本到图像扩散模型中的概念擦除旨在消除不需要的概念，同时保留整体生成能力。局部擦除方法旨在限制对目标概念占据的空间区域的编辑。然而，我们观察到抑制一个概念可能会无意中削弱语义相关的邻居概念，从而降低细粒度域的保真度。我们提出了邻居感知局部概念擦除（NLCE），这是一种免训练框架，旨在更好地保留相邻概念，同时删除目标概念。它分三个阶段运行：(1) 频谱加权嵌入调制，衰减目标概念方向，同时稳定邻近概念表示；(2) 注意力引导空间门，识别表现出残余概念激活的区域；(3) 空间门控硬擦除，仅在必要时消除剩余痕迹。这种邻居感知管道可以实现局部概念去除，同时保持周围的概念邻域结构。在细粒度数据集（Oxford Flowers、Stanford Dogs）上的实验表明，我们的方法有效地删除了目标概念，同时更好地保留了密切相关的类别。关于名人身份、露骨内容和艺术风格的其他结果表明了对更广泛的擦除场景的稳健性和泛化性。

Title: FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

Authors: Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26008
Pdf URL: https://arxiv.org/pdf/2603.26008
Copy Paste: [[2603.26008]] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants(https://arxiv.org/abs/2603.26008)
Keywords: generation
Abstract: While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at this https URL.
摘要：虽然多模态大语言模型 (MLLM) 在图像条件生成方面功能强大，但它可能会在不同人口群体中表现出参差不齐的表现，从而突出了公平风险。在安全关键的临床环境中，这种差异可能会产生不平等的诊断叙述，并削弱对人工智能辅助决策的信任。虽然公平性已在仅视觉和仅语言模型中进行了广泛研究，但其对 MLLM 的影响在很大程度上仍未得到充分研究。为了解决这些偏差，我们引入了 FairLLaVA，这是一种参数高效的微调方法，可以在不影响整体性能的情况下减轻视觉指令调整中的群体差异。通过最小化目标属性之间的互信息，FairLLaVA 将模型的表示规范化为人口统计不变的。该方法可以作为轻量级插件合并，通过低阶适配器微调来保持效率，并提供一种与体系结构无关的方法来公平地遵循视觉指令。对大规模胸部放射学报告生成和皮肤镜视觉问答基准的广泛实验表明，FairLLaVA 持续减少组间差异，同时提高不同医学成像模式的公平规模临床表现和自然语言生成质量。可以通过此 https URL 访问代码。

Title: Constitutive parameterized deep energy method for solid mechanics problems with random material parameters

Authors: Zhangyong Liang, Huanhuan Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.26030
Pdf URL: https://arxiv.org/pdf/2603.26030
Copy Paste: [[2603.26030]] Constitutive parameterized deep energy method for solid mechanics problems with random material parameters(https://arxiv.org/abs/2603.26030)
Keywords: generation
Abstract: In practical structural design and solid mechanics simulations, material properties inherently exhibit random variations within bounded intervals. However, evaluating mechanical responses under continuous material uncertainty remains a persistent challenge. Traditional numerical approaches, such as the Finite Element Method (FEM), incur prohibitive computational costs as they require repeated mesh discretization and equation solving for every parametric realization. Similarly, data-driven surrogate models depend heavily on massive, high-fidelity datasets, while standard physics-informed frameworks (e.g., the Deep Energy Method) strictly demand complete retraining from scratch whenever material parameters change. To bridge this critical gap, we propose the Constitutive Parameterized Deep Energy Method (CPDEM). In this purely physics-driven framework, the strain energy density functional is reformulated by encoding a latent representation of stochastic constitutive parameters. By embedding material parameters directly into the neural network alongside spatial coordinates, CPDEM transforms conventional spatial collocation points into parameter-aware material points. Trained in an unsupervised manner via expected energy minimization over the parameter domain, the pre-trained model continuously learns the solution manifold. Consequently, it enables zero-shot, real-time inference of displacement fields for unknown material parameters without requiring any dataset generation or model retraining. The proposed method is rigorously validated across diverse benchmarks, including linear elasticity, finite-strain hyperelasticity, and complex highly nonlinear contact mechanics. To the best of our knowledge, CPDEM represents the first purely physics-driven deep learning paradigm capable of simultaneously and efficiently handling continuous multi-parameter variations in solid mechanics.
摘要：在实际结构设计和固体力学模拟中，材料特性本质上表现出有限区间内的随机变化。然而，评估连续材料不确定性下的机械响应仍然是一个持续的挑战。传统的数值方法，例如有限元法 (FEM)，会产生高昂的计算成本，因为它们需要对每个参数实现进行重复的网格离散化和方程求解。同样，数据驱动的替代模型在很大程度上依赖于大量的高保真数据集，而标准的物理知识框架（例如深能方法）严格要求在材料参数发生变化时从头开始完整的重新训练。为了弥补这一关键差距，我们提出了本构参数化深度能量方法（CPDEM）。在这个纯粹物理驱动的框架中，通过编码随机本构参数的潜在表示来重新表述应变能密度泛函。通过将材料参数与空间坐标一起直接嵌入到神经网络中，CPDEM 将传统的空间配置点转换为参数感知的材料点。通过参数域上的预期能量最小化以无监督的方式进行训练，预训练模型不断学习解流形。因此，它可以对未知材料参数的位移场进行零样本实时推断，而无需生成任何数据集或模型重新训练。所提出的方法在不同的基准上经过严格验证，包括线性弹性、有限应变超弹性和复杂的高度非线性接触力学。据我们所知，CPDEM 代表了第一个纯粹物理驱动的深度学习范例，能够同时有效地处理固体力学中的连续多参数变化。

Title: Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

Authors: Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26049
Pdf URL: https://arxiv.org/pdf/2603.26049
Copy Paste: [[2603.26049]] Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays(https://arxiv.org/abs/2603.26049)
Keywords: generation
Abstract: Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at this https URL.
摘要：尽管医学视觉语言预训练最近取得了进展，但现有模型仍然难以捕捉诊断工作流程：射线照片通常被视为与上下文无关的图像，而放射科医生的目光（视觉推理的关键线索）在很大程度上仍然没有被现有方法充分探索。这些限制阻碍了疾病特定模式的建模并削弱了跨模式对齐。为了弥补这一差距，我们引入了 CoGaze，一种用于胸部 X 光检查的上下文和凝视引导的视觉语言预训练框架。我们首先提出了一种上下文注入的视觉编码器，该编码器可以模拟放射科医生如何整合临床上下文（包括患者病史、症状和诊断意图）来指导诊断推理。然后，我们提出了一种多级监督范式，该范式（1）通过混合正对比学习强制模内和模间语义对齐，（2）通过疾病感知的跨模态表示学习注入诊断先验，以及（3）利用放射科医生的注视作为概率先验来引导对诊断显着区域的注意力。大量实验表明，CoGaze 在各种任务中始终优于最先进的方法，在自由文本和结构化报告生成方面实现了高达 +2.0% CheXbertF1 和 +1.2% BLEU2，在零样本分类方面实现了 +23.2% AUROC，在图像文本检索方面实现了 +12.2% Precision@1。代码可从此 https URL 获取。

Title: Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

Authors: Qizhi Xie, Kun Yuan, Yunpeng Qu, Ming Sun, Chao Zhou, Jihong Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26055
Pdf URL: https://arxiv.org/pdf/2603.26055
Copy Paste: [[2603.26055]] Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline(https://arxiv.org/abs/2603.26055)
Keywords: quality assessment
Abstract: Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.
摘要：准确估计人类对视频流畅度的主观反馈，例如运动一致性和帧连续性，对于流媒体和游戏等各种应用至关重要。然而，它长期以来一直被忽视，因为现有技术集中于在视频质量评估（VQA）任务中解决它，仅仅将其作为整体质量的一个子维度。在这项工作中，我们进行了试点实验，结果表明当前的 VQA 预测在很大程度上低估了流畅性，从而限制了它们的适用性。为此，我们开创了视频流畅性评估（VFA）作为专注于时间维度的独立感知任务。为了推进 VFA 研究，1) 我们构建了一个以流畅性为导向的数据集 FluVid，其中包含 4,606 个具有平衡流畅性分布的野外视频，具有首个 VFA 评分标准和人类研究。 2) 我们开发了一个包含 23 种方法的大规模基准，这是迄今为止 FluVid 上最全面的基准，为 VFA 定制的模型设计收集见解。 3）我们提出了一个名为 FluNet 的基线模型，它部署时间排列自注意力（T-PSA）来丰富输入流畅性信息并增强远程帧间交互。我们的工作不仅实现了最先进的性能，更重要的是，为社区提供了探索 VFA 解决方案的路线图。

Title: R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting

Authors: Tianrui Lou, Siyuan Liang, Jiawei Liang, Yuze Gao, Xiaochun Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26067
Pdf URL: https://arxiv.org/pdf/2603.26067
Copy Paste: [[2603.26067]] R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting(https://arxiv.org/abs/2603.26067)
Keywords: generation
Abstract: Physical adversarial camouflage poses a severe security threat to autonomous driving systems by mapping adversarial textures onto 3D objects. Nevertheless, current methods remain brittle in complex dynamic scenarios, failing to generalize across diverse geometric (e.g., viewing configurations) and radiometric (e.g., dynamic illumination, atmospheric scattering) variations. We attribute this deficiency to two fundamental limitations in simulation and optimization. First, the reliance on coarse, oversimplified simulations (e.g., via CARLA) induces a significant domain gap, confining optimization to a biased feature space. Second, standard strategies targeting average performance result in a rugged loss landscape, leaving the camouflage vulnerable to configuration this http URL bridge these gaps, we propose the Relightable Physical 3D Gaussian Splatting (3DGS) based Attack framework (R-PGA). Technically, to address the simulation fidelity issue, we leverage 3DGS to ensure photo-realistic reconstruction and augment it with physically disentangled attributes to decouple intrinsic material from lighting. Furthermore, we design a hybrid rendering pipeline that leverages precise Relightable 3DGS for foreground rendering, while employing a pre-trained image translation model to synthesize plausible relighted backgrounds that align with the relighted this http URL address the optimization robustness issue, we propose the Hard Physical Configuration Mining (HPCM) module, designed to actively mine worst-case physical configurations and suppress their corresponding loss peaks. This strategy not only diminishes the overall loss magnitude but also effectively flattens the rugged loss landscape, ensuring consistent adversarial effectiveness and robustness across varying physical configurations.
摘要：物理对抗性伪装通过将对抗性纹理映射到 3D 对象上，对自动驾驶系统构成严重的安全威胁。然而，当前的方法在复杂的动态场景中仍然很脆弱，无法概括不同的几何（例如，观察配置）和辐射（例如，动态照明、大气散射）变化。我们将这种缺陷归因于模拟和优化的两个基本限制。首先，对粗略、过于简化的模拟（例如，通过 CARLA）的依赖会导致显着的域差距，将优化限制在有偏差的特征空间。其次，针对平均性能的标准策略会导致严重的损失情况，使伪装容易受到配置的影响，此 http URL 弥补了这些差距，我们提出了基于 Relightable 物理 3D 高斯泼溅 (3DGS) 的攻击框架 (R-PGA)。从技术上讲，为了解决模拟保真度问题，我们利用 3DGS 来确保逼真的重建，并通过物理分离的属性对其进行增强，以将内在材料与照明分离。此外，我们设计了一个混合渲染管道，利用精确的 Relightable 3DGS 进行前景渲染，同时采用预先训练的图像翻译模型来合成与重新点亮的 http URL 一致的合理的重新点亮的背景，解决了优化鲁棒性问题。我们提出了硬物理配置挖掘 (HPCM) 模块，旨在主动挖掘最坏情况的物理配置并抑制其相应的损失峰值。这种策略不仅减少了总体损失幅度，而且有效地平坦化了崎岖的损失景观，确保在不同的物理配置中保持一致的对抗有效性和鲁棒性。

Title: When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

Authors: Zhihan Chen, Yuhuan Zhao, Yijie Zhu, Xinyu Yao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26078
Pdf URL: https://arxiv.org/pdf/2603.26078
Copy Paste: [[2603.26078]] When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization(https://arxiv.org/abs/2603.26078)
Keywords: generative
Abstract: Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive "Illusion of Scalability" in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2's structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.
摘要：主题驱动的文本到图像的扩散模型在保留单一身份方面取得了显着的成功，但它们组成多个交互主题的能力在很大程度上仍未被探索且极具挑战性。现有的评估协议通常依赖于全局 CLIP 指标，该指标对本地身份崩溃不敏感，并且无法捕获多主体纠缠的严重性。在本文中，我们发现了当前模型中普遍存在的“可扩展性错觉”：虽然它们擅长以简单布局合成 2-4 个主题，但当扩展到 6-10 个主题或承担复杂的物理交互任务时，它们会遭受灾难性的身份崩溃。为了系统地揭示这种失败模式，我们构建了严格的压力测试基准，其中包括分布在不同主题数量和交互难度（中性、遮挡、交互）中的 75 个提示。此外，我们证明基于标准 CLIP 的指标对于这项任务来说存在根本缺陷，因为它们经常为语义正确但身份崩溃的图像分配高分（例如，生成通用克隆）。为了解决这个问题，我们引入了主题崩溃率（SCR），这是一种基于 DINOv2 结构先验的新颖评估指标，它严格惩罚局部注意力泄漏和同质化。我们对最先进模型（MOSAIC、XVerse、PSR）的广泛评估表明，随着场景复杂性的增加，身份保真度急剧下降，10 个受试者的 SCR 接近 100%。我们将这种崩溃追溯到全局注意力路由中固有的语义捷径，强调了未来生成架构中明确的物理解缠的迫切需要。

Title: TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

Authors: Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, Anuj Karpatne, Cheng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26128
Pdf URL: https://arxiv.org/pdf/2603.26128
Copy Paste: [[2603.26128]] TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life(https://arxiv.org/abs/2603.26128)
Keywords: generation
Abstract: Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.
摘要：准确地生成生命之树的图像非常困难：地球上有超过 1000 万种不同的物种，其中许多物种的区别仅在于细微的视觉特征。尽管在文本到图像合成方面取得了显着进展，但现有模型通常无法捕获定义物种身份的细粒度视觉线索，即使它们的输出看起来非常逼真。为此，我们提出了 TaxaAdapter，这是一种简单而轻量级的方法，它结合了 BioCLIP 等视觉分类模型（VTM）来指导细粒度物种的生成。我们的方法将 VTM 嵌入注入到冻结的文本到图像扩散模型中，提高物种级别的保真度，同时保留对姿势、风格和背景等属性的灵活文本控制。大量实验表明，TaxaAdapter 凭借更清晰的架构和训练配方，在强基线上持续提高了形态保真度和物种识别准确性。为了更好地评估这些改进，我们还引入了一种基于多模态大型语言模型的度量，该度量总结了生成图像和真实图像的特征级描述，从而提供了更可解释的形态一致性度量。除此之外，我们观察到 TaxaAdapter 表现出强大的泛化能力，能够在具有挑战性的情况下进行物种合成，例如只有少量训练图像的少数物种，甚至在训练期间看不见的物种。总的来说，我们的结果强调了 VTM 是可扩展、细粒度物种生成的关键要素。

Title: InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

Authors: Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang, Lu Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26134
Pdf URL: https://arxiv.org/pdf/2603.26134
Copy Paste: [[2603.26134]] InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution(https://arxiv.org/abs/2603.26134)
Keywords: super-resolution, generative
Abstract: Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.
摘要：视频超分辨率 (VSR) 旨在从低分辨率输入重建高分辨率帧。虽然基于扩散的方法大大提高了感知质量，但将其扩展到视频仍然具有挑战性，原因有两个：强生成先验可能会引入时间不稳定，并且多帧扩散管道对于实际部署而言通常过于昂贵。为了同时解决这两个挑战，我们提出了 InstaVSR，一种用于高效视频超分辨率的轻量级扩散框架。 InstaVSR 结合了三个要素：(1) 修剪后的一步扩散主干，从传统的基于扩散的 VSR 管道中删除了几个昂贵的组件，(2) 通过流引导时间正则化进行循环训练，以提高帧到帧的稳定性，(3) 潜在空间和像素空间中的双空间对抗性学习，以在主干简化后保持感知质量。在 NVIDIA RTX 4090 上，InstaVSR 在不到一分钟的时间内以 2K$\times$2K 分辨率处理 30 帧视频，仅使用 7 GB 内存，与现有的基于扩散的方法相比，大大降低了计算成本，同时保持良好的感知质量，时间过渡显着平滑。

Title: IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios

Authors: Xiaofeng Li, Leyi Sheng, Zhen Sun, Zongmin Zhang, Jiaheng Wei, Xinlei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26154
Pdf URL: https://arxiv.org/pdf/2603.26154
Copy Paste: [[2603.26154]] IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios(https://arxiv.org/abs/2603.26154)
Keywords: generation
Abstract: With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment this http URL address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods' robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model & cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.
摘要：随着图像到视频（I2V）生成模型的快速发展，它们被滥用以创建恶意内容的可能性已成为一个重大问题。例如，可以利用单个图像来生成虚假视频，该视频可用于吸引注意力并获得利益。这种现象称为 I2V 生成误用。现有的图像保护方法缺乏统一的基准，导致评估框架不完整。此外，这些方法尚未在 I2V 生成场景中以及针对预处理攻击方面进行系统评估，这使得评估其在实际部署中的有效性变得复杂。为了解决这一挑战，我们提出了 IP-Bench（图像保护基准），这是第一个旨在评估 I2V 生成场景中的保护方法的系统基准。该基准测试检查了 6 种代表性保护方法和 5 种最先进的 I2V 模型。此外，我们的工作系统地评估了实际场景下两种鲁棒性攻击策略的保护方法的鲁棒性，并分析了它们的跨模型和跨模态可迁移性。总体而言，IP-Bench为I2V生成场景中的图像保护方法建立了系统的、可重复的、可扩展的评估框架。

Title: Provably Contractive and High-Quality Denoisers for Convergent Restoration

Authors: Shubhi Shukla, Pravin Nair
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26168
Pdf URL: https://arxiv.org/pdf/2603.26168
Copy Paste: [[2603.26168]] Provably Contractive and High-Quality Denoisers for Convergent Restoration(https://arxiv.org/abs/2603.26168)
Keywords: restoration
Abstract: Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz $< 1$) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength $\|\delta\|\le\varepsilon$ induce at most $\varepsilon$ change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at this https URL
摘要：图像恢复，即从退化的测量中恢复干净的图像，在监视、国防和医学成像等各个领域都有应用。尽管实现了最先进的（SOTA）恢复性能，但现有的卷积和基于注意力的网络在输入微小变化下缺乏稳定性保证，暴露了鲁棒性和准确性的权衡。我们开发了可证明的收缩性（全球 Lipschitz 美元 < 1 美元）降噪器网络，可大大缩小这一差距。我们的设计由通过展开技术获得的近端层以及 Lipschitz 控制的卷积细化组成。通过收缩性，我们的降噪器保证强度 $\|\delta\|\le\varepsilon$ 的输入扰动最多在输出处引起 $\varepsilon$ 变化，而 DnCNN 和 Restormer 等强基线在相同扰动下可以表现出更大的偏差。在图像去噪方面，所提出的模型与无约束的 SOTA 去噪器具有竞争力，报告了可证明的 1-Lipschitz 模型的最紧密差距，并确定这种差距确实可以通过收缩去噪器实现。此外，所提出的降噪器充当图像恢复的强大正则器，可证明影响即插即用算法的收敛。我们的结果表明，实施严格的 Lipschitz 控制不会本质上降低输出质量，挑战文献中的常见假设，并使该领域朝着可验证和稳定的视觉模型发展。代码和预训练模型可在此 https URL 获取

Title: Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

Authors: Bozhao Li, Shaocong Wu, Tong Shao, Senqiao Yang, Qiben Shan, Zhuotao Tian, Jingyong Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26179
Pdf URL: https://arxiv.org/pdf/2603.26179
Copy Paste: [[2603.26179]] Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning(https://arxiv.org/abs/2603.26179)
Keywords: generation
Abstract: Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: this https URL.
摘要：开放词汇对象检测的最新进展主要集中在两个方面：扩大数据集和利用对比学习来调整语言和视觉模式。然而，这些方法常常忽视单一模式内的内部一致性，特别是当背景或环境发生变化时。这种一致性的缺乏会导致性能下降，因为模型很难在不同场景中检测到相同的对象，这揭示了鲁棒性差距。为了解决这个问题，我们引入了上下文一致性学习（CCL），这是一个集成了两个关键策略的新颖框架：上下文引导数据生成（CBDG）和上下文一致性损失（CCLoss）。 CBDG 作为一种数据生成机制，生成包含不同背景下的相同对象的图像。这是至关重要的，因为现有数据集本身并不支持我们的 CCL 框架。 CCLoss进一步增强了物体特征在环境变化时的不变性，从而提高了模型在不同场景下的鲁棒性。这些策略共同形成了一个统一的框架，以确保同一模式内的上下文一致性。我们的方法实现了最先进的性能，在 OmniLabel 上超越了之前的方法 +16.3 AP，在 D3 上超越了 +14.9 AP。这些结果证明了加强模式内一致性的重要性，显着增强了不同环境中的模型泛化能力。我们的代码可在以下位置公开获取：此 https URL。

Title: MemCam: Memory-Augmented Camera Control for Consistent Video Generation

Authors: Xinhang Gao, Junlin Guan, Shuhan Luo, Wenzhuo Li, Guanghuan Tan, Jiacheng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26193
Pdf URL: https://arxiv.org/pdf/2603.26193
Copy Paste: [[2603.26193]] MemCam: Memory-Augmented Camera Control for Consistent Video Generation(https://arxiv.org/abs/2603.26193)
Keywords: generation
Abstract: Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.
摘要：交互式视频生成在场景模拟和视频创建方面具有巨大潜力。然而，由于上下文信息有限，现有方法常常难以在动态摄像机控制下的长视频生成过程中保持场景一致性。为了应对这一挑战，我们提出了 MemCam，这是一种内存增强的交互式视频生成方法，它将先前生成的帧视为外部存储器，并将它们用作上下文调节，以实现具有高场景一致性的可控相机视点。为了实现更长、更相关的上下文，我们设计了一个上下文压缩模块，将内存帧编码为紧凑的表示，并采用基于共同可见性的选择来动态检索最相关的历史帧，从而减少计算开销，同时丰富上下文信息。交互式视频生成任务的实验表明，MemCam 在场景一致性方面显着优于现有的基线方法以及开源最先进的方法，特别是在相机旋转较大的长视频场景中。

Title: Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

Authors: Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26211
Pdf URL: https://arxiv.org/pdf/2603.26211
Copy Paste: [[2603.26211]] Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding(https://arxiv.org/abs/2603.26211)
Keywords: generation
Abstract: Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.
摘要：自回归 (AR) 视觉语言模型 (VLM) 长期以来一直主导着多模式理解、推理和图形用户界面 (GUI) 基础。最近，离散扩散视觉语言模型（DVLM）在多模态推理、提供双向注意力、并行令牌生成和迭代细化方面表现出了强大的性能。然而，它们在 GUI 基础上的潜力仍未被开发。在这项工作中，我们评估离散 DVLM 是否可以作为 GUI 基础 AR 模型的可行替代方案。我们采用 LLaDA-V 进行单轮动作和边界框预测，将任务框架为从多模态输入生成文本。为了更好地捕获边界框几何的层次结构，我们提出了一种混合掩蔽方案，该方案结合了线性掩蔽和确定性掩蔽，与使用线性掩蔽训练的 GUI 适应的 LLaDA-V 相比，步骤成功率 (SSR) 的接地精度提高了 6.1 个点。对涵盖网络、桌面和移动界面的四个数据集的评估表明，尽管预训练有限，但具有混合掩蔽的自适应扩散模型始终优于线性掩蔽变体，并且与自回归模型相比具有竞争力。系统消融表明，增加扩散步骤、生成长度和块长度可以提高准确性，但也会增加延迟，超过一定数量的扩散步骤后，准确性会趋于稳定。使用不同的 GUI 域扩展训练数据进一步减少了约 1.3 秒的延迟，并将基准测试中的接地精度平均提高了 20 个点。这些结果表明，离散 DVLM 是一种有前途的 GUI 基础建模框架，代表着迈向基于扩散的 GUI 代理的重要一步。

Title: DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation

Authors: Tomoya Miyawaki, Kazuto Nakashima, Yumi Iwashita, Ryo Kurazume
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.26263
Pdf URL: https://arxiv.org/pdf/2603.26263
Copy Paste: [[2603.26263]] DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation(https://arxiv.org/abs/2603.26263)
Keywords: generative
Abstract: LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at this https URL.
摘要：基于激光雷达的语义分割是自主移动机器人的关键组成部分，但激光雷达点云的大规模注释成本高昂且耗时。尽管模拟器可以提供标记的合成数据，但由于数据级域差距，在合成数据上训练的模型通常在现实数据上表现不佳。为了解决这个问题，我们提出了 DRUM，一种新颖的 Sim2Real 翻译框架。我们利用在未标记的真实世界数据上预先训练的扩散模型作为生成先验，并通过再现两个关键测量特征：反射强度和射线滴噪声来转换合成数据。为了提高样本保真度，我们引入了一种射线滴感知掩模引导机制，该机制有选择地强制与输入合成数据的一致性，同时保留由扩散先验引起的真实射线滴噪声。实验结果表明，DRUM 持续改进了 LiDAR 数据的多种表示形式的 Sim2Real 性能。项目页面可通过此 https URL 获取。

Title: PhysVid: Physics Aware Local Conditioning for Generative Video Models

Authors: Saurabh, Pathak, Elahe Arani, Mykola Pechenizkiy, Bahram Zonooz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26285
Pdf URL: https://arxiv.org/pdf/2603.26285
Copy Paste: [[2603.26285]] PhysVid: Physics Aware Local Conditioning for Generative Video Models(https://arxiv.org/abs/2603.26285)
Keywords: generation, generative
Abstract: Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\approx 33\%$ over baseline video generators, and by up to $\approx 8\%$ on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.
摘要：生成视频模型实现了高视觉保真度，但常常违反基本物理原理，限制了现实世界环境中的可靠性。之前注入物理的尝试依赖于调节：帧级信号是特定于域的且短视野的，而全局文本提示是粗糙且嘈杂的，缺少细粒度的动态。我们提出了 PhysVid，一种物理感知的局部调节方案，可在时间上连续的帧块上运行。每个块都用基于物理的状态、交互和约束描述进行注释，这些描述通过训练期间的块感知交叉注意力与全局提示融合。在推论中，我们引入了负面的物理提示（对当地相关违法行为的描述），以引导一代人远离难以置信的轨迹。在 VideoPhy 上，PhysVid 比基准视频生成器提高了物理常识分数 $\约 33\%$，在 VideoPhy2 上提高了 $\约 8\%$。这些结果表明，本地的物理感知指导极大地提高了生成视频的物理合理性，并标志着向基于物理的视频模型迈出了一步。

Title: Label-Free Cross-Task LoRA Merging with Null-Space Compression

Authors: Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26317
Pdf URL: https://arxiv.org/pdf/2603.26317
Copy Paste: [[2603.26317]] Label-Free Cross-Task LoRA Merging with Null-Space Compression(https://arxiv.org/abs/2603.26317)
Keywords: generation
Abstract: Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $\Delta W = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.
摘要：模型合并结合了独立微调的检查点，无需联合多任务训练。在基础模型时代，低秩适应（LoRA）微调非常普遍，这使得 LoRA 合并成为一个有前景的目标。现有方法可以在所有目标任务都是分类的同质环境中工作，但当任务跨越分类和回归时通常会失败。使用基于熵的代理的方法不适用于回归，并且由于标记序列较长，对于大型语言模型来说成本高昂。我们引入了空空间压缩（NSC）合并，这是一种无标签、与输出无关的方法，可根据适配器几何形状设置合并权重。我们的主要观察结果是，在 LoRA 微调期间，$\Delta W = BA$ 中的下投影因子 $A$ 压缩了其零空间，并且压缩与性能相关。 NSC 将此用作合并的优化信号，可以泛化到分类、回归和序列生成。 NSC 在 20 个异构视觉任务中实现了最先进的性能，并在先前的方法过度拟合任务子集的情况下获得了平衡的增益。它还优于六个 NLI 基准以及 VQA 和图像字幕的视觉语言评估的基线，展示了可扩展性和有效性。

Title: Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization

Authors: Zidong Zhao, Yihao Huang, Qing Guo, Tianlin Li, Anran Li, Kailong Wang, Jin Song Dong, Geguang Pu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26328
Pdf URL: https://arxiv.org/pdf/2603.26328
Copy Paste: [[2603.26328]] Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization(https://arxiv.org/abs/2603.26328)
Keywords: generation
Abstract: As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches its claim. Existing methods address this by using verification prompts generated by official model owners, but the generation relies on multiple reference models for optimization, leading to high computational cost and sensitivity to model selection. To address this problem, we propose a reference-free T2I model verification method called Boundary-aware Prompt Optimization (BPO). It directly explores the intrinsic characteristics of the target model. The key insight is that although different T2I models produce similar outputs for normal prompts, their semantic boundaries in the embedding space (transition zones between two concepts such as "corgi" and "bagel") are distinct. Prompts near these boundaries generate unstable outputs (e.g., sometimes a corgi and sometimes a bagel) on the target model but remain stable on other models. By identifying such boundary-adjacent prompts, BPO captures model-specific behaviors that serve as reliable verification cues for distinguishing T2I models. Experiments on five T2I models and four baselines demonstrate that BPO achieves superior verification accuracy.
摘要：随着文本到图像 (T2I) 生成的普及，第三方平台越来越多地集成多个模型 API 以方便图像创建。然而，使用官方模型的虚假声明可能会误导用户并损害模型所有者的声誉，因此模型验证对于确认 API 的底层模型是否与其声明相符至关重要。现有方法通过使用官方模型所有者生成的验证提示来解决此问题，但生成依赖于多个参考模型进行优化，导致计算成本较高且对模型选择敏感。为了解决这个问题，我们提出了一种称为边界感知提示优化（BPO）的无参考 T2I 模型验证方法。它直接探索目标模型的内在特征。关键的见解是，尽管不同的 T2I 模型为正常提示产生相似的输出，但它们在嵌入空间中的语义边界（“corgi”和“bagel”等两个概念之间的过渡区域）是不同的。这些边界附近的提示会在目标模型上生成不稳定的输出（例如，有时是柯基犬，有时是百吉饼），但在其他模型上保持稳定。通过识别此类边界相邻提示，BPO 捕获特定于模型的行为，这些行为可作为区分 T2I 模型的可靠验证线索。对五个 T2I 模型和四个基线的实验表明，BPO 实现了卓越的验证准确性。

Title: Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

Authors: Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26348
Pdf URL: https://arxiv.org/pdf/2603.26348
Copy Paste: [[2603.26348]] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification(https://arxiv.org/abs/2603.26348)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at this https URL.
摘要：多模态大型语言模型（MLLM）实现了强大的多模态推理性能，但我们在长格式生成中发现了一个反复出现的失败模式：随着输出变长，模型逐渐偏离图像证据并依赖于文本先验，导致毫无根据的推理和幻觉。有趣的是，根据注意力分析，我们发现 MLLM 具有进行后期视觉验证的潜在能力，该能力存在但并未持续激活。受这一观察的启发，我们提出了视觉重新检查（VRE），这是一种自我进化的训练框架，使 MLLM 能够在推理过程中自主执行视觉内省，而无需额外的视觉输入。 VRE 不是从更强大的老师那里提炼视觉能力，而是通过利用模型本身生成反射轨迹来促进迭代式自我改进，通过信息增益使视觉信息变得可操作。跨多种多模式基准的大量实验表明，VRE 持续提高推理准确性和感知可靠性，同时大幅减少幻觉，尤其是在长链环境中。代码可从此 https URL 获取。

Title: MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

Authors: Quan Dao, Dimitris Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26357
Pdf URL: https://arxiv.org/pdf/2603.26357
Copy Paste: [[2603.26357]] MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model(https://arxiv.org/abs/2603.26357)
Keywords: generative
Abstract: Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50\% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at \url{this https URL}
摘要：变压器架构，特别是扩散变压器 (DiT)，由于其比卷积 UNet 更强的性能，已广泛应用于扩散和流匹配模型。然而，DiT 的各向同性设计在每个块中处理相同数量的补丁标记，导致训练过程中的计算量相对较大。在这项工作中，我们引入了一种多补丁变压器设计，其中早期的块在较大的补丁上运行以捕获粗略的全局上下文，而后期的块使用较小的补丁来细化局部细节。这种分层设计可以将 GFLOP 的计算成本降低高达 50%，同时实现良好的生成性能。此外，我们还提出了时间和类别嵌入的改进设计，以加速训练收敛。对 ImageNet 数据集的大量实验证明了我们架构选择的有效性。代码发布于\url{此 https URL}

Title: A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models

Authors: Steffen Herbold, Florian Lemmerich
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.26363
Pdf URL: https://arxiv.org/pdf/2603.26363
Copy Paste: [[2603.26363]] A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models(https://arxiv.org/abs/2603.26363)
Keywords: generation
Abstract: The generation of texts using Large Language Models (LLMs) is inherently uncertain, with sources of uncertainty being not only the generation of texts, but also the prompt used and the downstream interpretation. Within this work, we provide a formal framework for the measurement of uncertainty that takes these different aspects into account. Our framework models prompting, generation, and interpretation as interconnected autoregressive processes that can be combined into a single sampling tree. We introduce filters and objective functions to describe how different aspects of uncertainty can be expressed over the sampling tree and demonstrate how to express existing approaches towards uncertainty through these functions. With our framework we show not only how different methods are formally related and can be reduced to a common core, but also point out additional aspects of uncertainty that have not yet been studied.
摘要：使用大型语言模型（LLM）生成文本本质上是不确定的，不确定性的来源不仅在于文本的生成，还在于所使用的提示和下游解释。在这项工作中，我们提供了一个考虑到这些不同方面的不确定性测量的正式框架。我们的框架将提示、生成和解释建模为相互关联的自回归过程，可以组合成单个采样树。我们引入过滤器和目标函数来描述如何在采样树上表达不确定性的不同方面，并演示如何通过这些函数表达现有的不确定性方法。通过我们的框架，我们不仅展示了不同的方法如何在形式上相关并可以简化为共同的核心，而且还指出了尚未研究的不确定性的其他方面。

Title: Generative Modeling in Protein Design: Neural Representations, Conditional Generation, and Evaluation Standards

Authors: Senura Hansaja Wanasekara, Minh-Duong Nguyen, Xiaochen Liu, Nguyen H. Tran, Ken-Tye Yong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26378
Pdf URL: https://arxiv.org/pdf/2603.26378
Copy Paste: [[2603.26378]] Generative Modeling in Protein Design: Neural Representations, Conditional Generation, and Evaluation Standards(https://arxiv.org/abs/2603.26378)
Keywords: generation, generative
Abstract: Generative modeling has become a central paradigm in protein research, extending machine learning beyond structure prediction toward sequence design, backbone generation, inverse folding, and biomolecular interaction modeling. However, the literature remains fragmented across representations, model classes, and task formulations, making it difficult to compare methods or identify appropriate evaluation standards. This survey provides a systematic synthesis of generative AI in protein research, organized around (i) foundational representations spanning sequence, geometric, and multimodal encodings; (ii) generative architectures including $\mathrm{SE}(3)$-equivariant diffusion, flow matching, and hybrid predictor-generator systems; and (iii) task settings from structure prediction and de novo design to protein-ligand and protein-protein interactions. Beyond cataloging methods, we compare assumptions, conditioning mechanisms, and controllability, and we synthesize evaluation best practices that emphasize leakage-aware splits, physical validity checks, and function-oriented benchmarks. We conclude with critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for dual-use biosecurity risks. By unifying architectural advances with practical evaluation standards and responsible development considerations, this survey aims to accelerate the transition from predictive modeling to reliable, function-driven protein engineering.
摘要：生成建模已成为蛋白质研究的中心范式，将机器学习从结构预测扩展到序列设计、主链生成、反向折叠和生物分子相互作用建模。然而，文献在表征、模型类别和任务制定方面仍然支离破碎，因此很难比较方法或确定适当的评估标准。这项调查提供了蛋白质研究中生成人工智能的系统综合，围绕（i）跨越序列、几何和多模态编码的基础表示； (ii) 生成架构，包括 $\mathrm{SE}(3)$-等变扩散、流匹配和混合预测生成器系统； (iii)从结构预测和从头设计到蛋白质-配体和蛋白质-蛋白质相互作用的任务设置。除了编目方法之外，我们还比较假设、调节机制和可控性，并综合评估最佳实践，强调泄漏感知分割、物理有效性检查和面向功能的基准。最后，我们提出了关键的开放挑战：对构象动力学和本质上无序的区域进行建模，在保持效率的同时扩展到大型组件，以及针对双重用途生物安全风险开发强大的安全框架。通过将架构进步与实际评估标准和负责任的开发考虑相结合，这项调查旨在加速从预测建模到可靠、功能驱动的蛋白质工程的转变。

Title: Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

Authors: I-Hsiang Chen, Isma Hadji, Enrique Sanchez, Adrian Bulat, Sy-Yen Kuo, Radu Timofte, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26385
Pdf URL: https://arxiv.org/pdf/2603.26385
Copy Paste: [[2603.26385]] Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration(https://arxiv.org/abs/2603.26385)
Keywords: restoration, quality assessment
Abstract: Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.
摘要：图像恢复旨在从因各种因素（例如恶劣天气、模糊或弱光）而劣化的输入中恢复高质量图像。虽然最近的研究表明在单个或统一的恢复任务中取得了显着的进展，但在处理未知或复合退化时，它们仍然存在泛化能力有限和效率低下的问题。为了解决这些限制，我们提出了 RAR，一种恢复、评估和重复过程，它将图像质量评估 (IQA) 和图像恢复 (IR) 集成到一个统一的框架中，以迭代、高效地实现高质量图像恢复。具体来说，我们引入了一种完全在潜在域中运行的恢复过程，以联合执行退化识别、图像恢复和质量验证。生成的模型是完全可端到端训练的，并允许采用动态适应恢复过程的一体化评估和恢复方法。此外，将 IQA 和 IR 紧密集成到统一模型中，可以最大限度地减少通常因保持两个模块不相交而产生的延迟和信息丢失（例如在图像和/或文本解码期间）。大量的实验表明，我们的方法在单一、未知和复合退化下取得了一致的改进，从而建立了新的最先进技术。

Title: SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Authors: Weihong Pan, Xiaoyu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26481
Pdf URL: https://arxiv.org/pdf/2603.26481
Copy Paste: [[2603.26481]] SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras(https://arxiv.org/abs/2603.26481)
Keywords: generative
Abstract: High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.
摘要：高质量 4D 重建可实现动态现实世界的逼真且身临其境的渲染。然而，与单个摄像机可以完全捕获的静态场景不同，高质量的动态场景通常需要数十甚至数百个同步摄像机的密集阵列。对如此昂贵的实验室设置的依赖严重限制了实际的可扩展性。对如此昂贵的实验室设置的依赖严重限制了实际的可扩展性。为此，我们提出了一种稀疏相机动态重建框架，该框架利用丰富但不一致的生成观察。我们的关键创新是时空扭曲场，它提供了一个统一的机制，用于对空间和时间维度上的生成观察中的不一致进行建模。在此基础上，我们开发了一个完整的管道，可以从稀疏且未校准的相机输入中进行 4D 重建。我们在多摄像机动态场景基准上评估我们的方法，实现时空一致的高保真渲染，并显着优于现有方法。

Title: Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays

Authors: Martin Rath, Morteza Ghahremani, Yitong Li, Ashkan Taghipour, Marcus Makowski, Christian Wachinger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26509
Pdf URL: https://arxiv.org/pdf/2603.26509
Copy Paste: [[2603.26509]] Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays(https://arxiv.org/abs/2603.26509)
Keywords: super-resolution
Abstract: Computed tomography (CT) provides rich 3D anatomical details but is often constrained by high radiation exposure, substantial costs, and limited availability. While standard chest X-rays are cost-effective and widely accessible, they only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays offers a transformative solution to increase diagnostic accessibility, yet existing methods predominantly rely on synthetic X-ray projections, limiting clinical generalization. In this work, we propose AXON, a multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real X-rays. AXON employs a coarse-to-fine strategy, with a Brownian Bridge diffusion model-based initial stage for global structural synthesis, followed by a ControlNet-based refinement stage for local intensity optimization. It also supports bi-planar X-ray input to mitigate depth ambiguities inherent in 2D-to-3D reconstruction. A super-resolution network is integrated to upscale the generated volumes to achieve diagnostic-grade resolution. Evaluations on both public and external datasets demonstrate that AXON significantly outperforms state-of-the-art baselines, achieving a 11.9% improvement in PSNR and a 11.0% increase in SSIM with robust generalizability across disparate clinical distributions. Our code is available at this https URL.
摘要：计算机断层扫描 (CT) 可提供丰富的 3D 解剖细节，但通常受到高辐射暴露、高昂成本和有限可用性的限制。虽然标准胸部 X 光检查具有成本效益且广泛使用，但它们仅提供具有有限病理信息的 2D 投影。从 2D X 射线重建 3D CT 体积提供了一种革命性的解决方案，可提高诊断的可及性，但现有方法主要依赖于合成 X 射线投影，限制了临床推广。在这项工作中，我们提出了 AXON，这是一种基于多级扩散的框架，可直接从真实 X 射线重建高保真 3D CT 体积。 AXON 采用从粗到细的策略，采用基于布朗桥扩散模型的初始阶段进行全局结构综合，然后是基于 ControlNet 的细化阶段进行局部强度优化。它还支持双平面 X 射线输入，以减轻 2D 到 3D 重建中固有的深度模糊性。集成超分辨率网络以升级生成的体积，以实现诊断级分辨率。对公共和外部数据集的评估表明，AXON 的性能显着优于最先进的基线，PSNR 提高了 11.9%，SSIM 提高了 11.0%，在不同的临床分布中具有强大的通用性。我们的代码可以在这个 https URL 上找到。

Title: AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing

Authors: Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Liu, Yuan Liu, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26546
Pdf URL: https://arxiv.org/pdf/2603.26546
Copy Paste: [[2603.26546]] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing(https://arxiv.org/abs/2603.26546)
Keywords: generative
Abstract: Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.
摘要：生成视频模型显着提高了自动驾驶恶劣天气的真实感合成；然而，他们始终需要大量数据集来了解罕见的天气情况。虽然 3D 感知编辑方法通过增强现有视频片段来缓解这些数据限制，但它们从根本上受到成本高昂的每场景优化的瓶颈，并且受到固有的几何和照明纠缠的影响。在这项工作中，我们介绍了 AutoWeather4D，这是一种前馈 3D 感知天气编辑框架，旨在显式解耦几何体和照明。我们方法的核心是 G 缓冲区双通道编辑机制。几何通道利用显式结构基础来实现表面锚定的物理交互，而光通道则通过分析方式解决光传输问题，将局部光源的贡献累积到全局照明中，以实现动态 3D 局部重新照明。大量实验表明，AutoWeather4D 实现了与生成基线相当的真实感和结构一致性，同时实现了细粒度参数化物理控制，可作为自动驾驶的实用数据引擎。

Title: HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Authors: Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26553
Pdf URL: https://arxiv.org/pdf/2603.26553
Copy Paste: [[2603.26553]] HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching(https://arxiv.org/abs/2603.26553)
Keywords: generation
Abstract: While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.
摘要：虽然协同语音手势生成领域取得了重大进展，但生成整体的、基于语义的手势仍然是一个挑战。现有的方法依赖于外部语义检索方法，由于依赖于预定义的语言规则，这限制了它们的泛化能力。基于流程匹配的方法产生了有希望的结果；然而，该网络仅使用语义一致的样本进行优化，而不接触负面示例，从而导致学习有节奏的手势而不是稀疏的运动，例如标志性和隐喻性的手势。此外，通过孤立地建模身体部位，大多数方法无法保持跨模式一致性。我们引入了一种基于对比流匹配的协同语音手势生成模型，该模型使用不匹配的音频文本条件作为负数，训练速度场遵循正确的运动轨迹，同时排斥语义不一致的轨迹。我们的模型通过余弦和对比目标将文本、音频和整体运动嵌入到复合潜在空间中，确保跨模式连贯性。大量的实验和用户研究表明，我们提出的方法在 BEAT2 和 SHOW 两个数据集上优于最先进的方法。

Title: Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

Authors: Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, Hiroshi Watanabe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26571
Pdf URL: https://arxiv.org/pdf/2603.26571
Copy Paste: [[2603.26571]] Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow(https://arxiv.org/abs/2603.26571)
Keywords: generation, generative
Abstract: Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
摘要：现有的生成视频压缩方法仅使用生成模型作为传统编解码器之上的事后重建模块。我们提出了\emph{生成视频编解码器}（GVC），这是一种零样本框架，它将预训练的视频生成模型转变为编解码器本身：传输的比特流直接指定生成解码轨迹，无需重新训练。为了实现这一点，我们在推理时将现代视频基础模型的确定性整流流 ODE 转换为等效的 SDE，从而解锁用于码本驱动压缩的每步随机注入点。在此统一的骨干基础上，我们实例化了三种互补的调节策略——具有自适应尾帧原子分配的\emph{图像到视频}（I2V），在近零辅助信息下作为纯生成先验运行的\emph{文本到视频}（T2V），以及具有用于双锚时间控制的边界共享GOP链接的\emph{首尾帧到视频}（FLF2V）。总之，这些变体跨越了空间保真度、时间一致性和压缩效率之间的原则性权衡空间。标准基准测试表明，GVC 实现了低于 0.002bpp 的高质量重建，同时通过单个超参数支持灵活的比特率控制。

Title: From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

Authors: Dávid Pukanec, Tibor Kubík, Michal Španěl
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26588
Pdf URL: https://arxiv.org/pdf/2603.26588
Copy Paste: [[2603.26588]] From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion(https://arxiv.org/abs/2603.26588)
Keywords: restoration, generation
Abstract: We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: this https URL
摘要：我们提出了 ToothCraft，一种基于扩散的模型，用于牙冠的上下文生成，并在人工创建的不完整牙齿上进行训练。基于 3D 形状条件扩散模型的最新进展，我们开发了一种能够根据局部解剖环境自动完成牙冠的模型。为了解决此任务缺乏训练数据的问题，我们设计了一个增强管道，可以从公开的完整牙弓数据集（3DS、ODD）生成不完整的牙齿几何形状。通过综合一组不同的训练示例，我们的方法可以对各种牙齿缺陷进行稳健的学习。实验结果表明，我们的模型重建完整牙冠的强大能力，在综合损坏的测试修复体上实现了 81.8% 的交叉交集 (IoU) 和 0.00034 的倒角距离 (CD)。我们的实验表明，该模型可以直接应用于现实世界的案例，有效地填充不完整的牙齿，同时生成的牙冠与相对牙列的交叉最小化，从而降低了咬合干扰的风险。可以通过以下网址访问代码、模型权重和数据集信息：此 https URL

Title: Characterization and forecasting of national-scale solar power ramp events

Authors: Luca Lanzilao, Angela Meyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.26596
Pdf URL: https://arxiv.org/pdf/2603.26596
Copy Paste: [[2603.26596]] Characterization and forecasting of national-scale solar power ramp events(https://arxiv.org/abs/2603.26596)
Keywords: generation
Abstract: The rapid growth of solar energy is reshaping power system operations and increasing the complexity of grid management. As photovoltaic (PV) capacity expands, short-term fluctuations in PV generation introduce substantial operational uncertainty. At the same time, solar power ramp events intensify risks of grid instability and unplanned outages due to sudden large power fluctuations. Accurate identification, forecasting and mitigation of solar ramp events are therefore critical to maintaining grid stability. In this study, we analyze two years of PV power production from 6434 PV stations at 15-minute resolution. We develop quantitative metrics to define solar ramp events and systematically characterize their occurrence, frequency, and magnitude at a national scale. Furthermore, we examine the meteorological drivers of ramp events, highlighting the role of mesoscale cloud systems. In particular, we observe that ramp-up events are typically associated with cloud dissipation during the morning, while ramp-down events commonly occur when cloud cover increases in the afternoon. Additionally, we adopt a recently developed spatiotemporal forecasting framework to evaluate both deterministic and probabilistic PV power forecasts derived from deep learning and physics-based models, including SolarSTEPS, SHADECast, IrradianceNet, and IFS-ENS. The results show that SHADECast is the most reliable model, achieving a CRPS 10.8% lower than that of SolarSTEPS at a two-hour lead time. Nonetheless, state-of-the-art nowcasting models struggle to capture ramp dynamics, with forecast RMSE increasing by up to 50% compared to normal operating conditions. Overall, these results emphasize the need for improved high-resolution spatiotemporal modelling to enhance ramp prediction skill and support the reliable integration of large-scale solar generation into power systems.
摘要：太阳能的快速增长正在重塑电力系统的运行并增加电网管理的复杂性。随着光伏发电容量的扩大，光伏发电的短期波动会带来巨大的运营不确定性。与此同时，太阳能发电爬坡事件加剧了电网不稳定和突然大幅电力波动导致意外停电的风险。因此，准确识别、预测和缓解太阳能斜坡事件对于维持电网稳定性至关重要。在本研究中，我们以 15 分钟的分辨率分析了 6434 个光伏电站两年的光伏发电量。我们开发定量指标来定义太阳斜坡事件，并在全国范围内系统地描述其发生、频率和强度。此外，我们研究了斜坡事件的气象驱动因素，强调了中尺度云系统的作用。特别是，我们观察到上升事件通常与上午的云消散有关，而下降事件通常发生在下午云量增加时。此外，我们采用最近开发的时空预测框架来评估源自深度学习和基于物理的模型（包括 SolarSTEPS、SHADECast、IrradianceNet 和 IFS-ENS）的确定性和概率性光伏发电预测。结果表明，SHADECast 是最可靠的模型，在两小时的交付时间内，CRPS 比 SolarSTEPS 低 10.8%。尽管如此，最先进的临近预报模型仍难以捕捉斜坡动态，与正常运行条件相比，预测 RMSE 增加高达 50%。总体而言，这些结果强调需要改进高分辨率时空建模，以增强斜坡预测技能并支持大规模太阳能发电与电力系统的可靠集成。

Title: VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Authors: Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26599
Pdf URL: https://arxiv.org/pdf/2603.26599
Copy Paste: [[2603.26599]] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward(https://arxiv.org/abs/2603.26599)
Keywords: generation
Abstract: Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.
摘要：大规模视频扩散模型实现了令人印象深刻的视觉质量，但往往无法保持几何一致性。先前的方法通过使用附加模块增强生成器或应用几何感知对齐来提高一致性。然而，架构修改可能会损害互联网规模预训练模型的泛化，而现有的对齐方法仅限于静态场景，并依赖于需要重复 VAE 解码的 RGB 空间奖励，从而产生大量的计算开销，并且无法泛化到高度动态的现实世界场景。为了保留预训练能力，同时提高几何一致性，我们提出了 VGGRPO（视觉几何 GRPO），这是一种用于几何感知视频后期训练的潜在几何引导框架。 VGGRPO 引入了潜在几何模型 (LGM)，它将视频扩散潜在模型缝合到几何基础模型，从而能够从潜在空间直接解码场景几何模型。通过从具有 4D 重建能力的几何模型构建 LGM，VGGRPO 自然地扩展到动态场景，克服了现有方法的静态场景限制。在此基础上，我们执行潜在空间组相对策略优化，并提供两种互补的奖励：惩罚抖动轨迹的相机运动平滑度奖励，以及强制跨视图几何一致性的几何重投影一致性奖励。静态和动态基准测试表明，VGGRPO 提高了相机稳定性、几何一致性和整体质量，同时消除了昂贵的 VAE 解码，使潜在空间几何引导增强成为一种高效、灵活的方法来生成世界一致的视频。

Title: Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

Authors: Ruixing Zhang, Hanzhang Jiang, Leilei Sun, Liangzhe Han, Jibin Wang, Weifeng Lv
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26610
Pdf URL: https://arxiv.org/pdf/2603.26610
Copy Paste: [[2603.26610]] Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling(https://arxiv.org/abs/2603.26610)
Keywords: generation
Abstract: Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.
摘要：移动设备不断与蜂窝基站交互，生成大量信令记录，为了解人类移动性提供广泛的覆盖范围。然而，此类记录仅提供粗略的位置线索（例如，服务小区标识符），因此限制了它们在需要高精度 GPS 轨迹的应用中的直接使用。本文研究 Sig2GPS 问题：从蜂窝信号重建 GPS 轨迹。受领域专家经常在地图上铺设信令轨迹并绘制相应的 GPS 路线的启发，与依赖复杂的多阶段工程管道或回归坐标的传统解决方案不同，Sig2GPS 被重新构建为直接在地图视觉域中运行的图像到视频生成任务：在地图上渲染信令轨迹，并训练视频生成模型来绘制连续的 GPS 路径。为了支持这种范式，构建了配对的信号到轨迹视频数据集来微调开源视频模型，并引入了基于轨迹感知强化学习的优化方法，以通过奖励提高生成保真度。对大规模现实世界数据集的实验表明，与强大的工程和基于学习的基线相比，有了显着的改进，而下一个 GPS 预测的其他结果表明了可扩展性和跨城市可转移性。总的来说，这些结果表明，地图视觉视频生成通过在地图约束下直接生成和细化连续路径，为轨迹数据挖掘提供了实用的接口。

Title: GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

Authors: Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26661
Pdf URL: https://arxiv.org/pdf/2603.26661
Copy Paste: [[2603.26661]] GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation(https://arxiv.org/abs/2603.26661)
Keywords: generation, generative
Abstract: Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.
摘要：3D 生成建模的最新进展依赖于扩散或流量匹配公式。相反，我们探索一种完全自回归的替代方案，并引入 GaussianGPT，这是一种基于 Transformer 的模型，可通过下一个标记预测直接生成 3D 高斯，从而促进完整的 3D 场景生成。我们首先使用具有矢量量化功能的稀疏 3D 卷积自动编码器将高斯基元压缩为离散潜在网格。使用具有 3D 旋转位置嵌入的因果转换器对生成的令牌进行序列化和建模，从而能够顺序生成空间结构和外观。与整体细化场景的基于扩散的方法不同，我们的公式逐步构建场景，自然支持完成、覆盖、通过温度控制采样和灵活的生成范围。该公式利用了自回归建模的组合归纳偏差和可扩展性，同时在与现代神经渲染管道兼容的显式表示上进行操作，将自回归变压器定位为可控和上下文感知 3D 生成的补充范例。