2024-12-05

Title: DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America

Authors: Daniel Seal, Rossella Arcucci, Salva Rühling-Cachay, César Quilodrán-Casas
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02723
Pdf URL: https://arxiv.org/pdf/2412.02723
Copy Paste: [[2412.02723]] DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America(https://arxiv.org/abs/2412.02723)
Keywords: generative
Abstract: Climate change is increasing the frequency of extreme precipitation events, making weather disasters such as flooding and landslides more likely. The ability to accurately nowcast precipitation is therefore becoming more critical for safeguarding society by providing immediate, accurate information to decision makers. Motivated by the recent success of generative models at precipitation nowcasting, this paper: extends the DYffusion framework to this task and evaluates its performance at forecasting IMERG satellite precipitation data up to a 4-hour horizon; modifies the DYffusion framework to improve its ability to model rainfall data; and introduces a novel loss function that combines MSE, MAE and the LPIPS perceptual score. In a quantitative evaluation of forecasts up to a 4-hour horizon, the modified DYffusion framework trained with the novel loss outperforms four competitor models. It has the highest CSI scores for weak, moderate, and heavy rain thresholds and retains an LPIPS score $<$ 0.2 for the entire roll-out, degrading the least as lead-time increases. The proposed nowcasting model demonstrates visually stable and sharp forecasts up to a 2-hour horizon on a heavy rain case study. Code is available at this https URL.

Title: Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications

Authors: Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, Srija Chakraborty, Sizhe Wang, Ankur Kumar, Myscon Truong, Denys Godwin, Hyunho Lee, Chia-Yu Hsu, Ata Akbari Asanjan, Besart Mujeci, Trevor Keenan, Paulo Arevalo, Wenwen Li, Hamed Alemohammad, Pontus Olofsson, Christopher Hain, Robert Kennedy, Bianca Zadrozny, Gabriele Cavallaro, Campbell Watson, Manil Maskey, Rahul Ramachandran, Juan Bernabe Moreno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02732
Pdf URL: https://arxiv.org/pdf/2412.02732
Copy Paste: [[2412.02732]] Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications(https://arxiv.org/abs/2412.02732)
Keywords: foundation model
Abstract: This technical report presents Prithvi-EO-2.0, a new geospatial foundation model that offers significant improvements over its predecessor, Prithvi-EO-1.0. Trained on 4.2M global time series samples from NASA's Harmonized Landsat and Sentinel-2 data archive at 30m resolution, the new 300M and 600M parameter models incorporate temporal and location embeddings for enhanced performance across various geospatial tasks. Through extensive benchmarking with GEO-Bench, the 600M version outperforms the previous Prithvi-EO model by 8\% across a range of tasks. It also outperforms six other geospatial foundation models when benchmarked on remote sensing tasks from different domains and resolutions (i.e. from 0.1m to 15m). The results demonstrate the versatility of the model in both classical earth observation and high-resolution applications. Early involvement of end-users and subject matter experts (SMEs) are among the key factors that contributed to the project's success. In particular, SME involvement allowed for constant feedback on model and dataset design, as well as successful customization for diverse SME-led applications in disaster response, land use and crop mapping, and ecosystem dynamics monitoring. Prithvi-EO-2.0 is available on Hugging Face and IBM terratorch, with additional resources on GitHub. The project exemplifies the Trusted Open Science approach embraced by all involved organizations.

Title: Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Authors: Zhaozhi Wang, Conghu Li, Qixiang Ye, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02759
Pdf URL: https://arxiv.org/pdf/2412.02759
Copy Paste: [[2412.02759]] Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2412.02759)
Keywords: diffusion
Abstract: Most parameter-efficient fine-tuning (PEFT) methods rely on low-rank representations to adapt models. However, these approaches often oversimplify representations, particularly when the underlying data has high-rank or high-frequency components. This limitation hinders the model's ability to capture complex data interactions effectively. In this paper, we propose a novel approach that models network weights by leveraging a combination of physical priors, enabling more accurate approximations. We use three foundational equations -- heat diffusion, wave propagation, and Poisson's steady-state equation -- each contributing distinctive modeling properties: heat diffusion enforces local smoothness, wave propagation facilitates long-range interactions, and Poisson's equation captures global equilibrium. To combine these priors effectively, we introduce the Mixture of Physical Priors Adapter (MoPPA), using an efficient Discrete Cosine Transform (DCT) implementation. To dynamically balance these priors, a route regularization mechanism is designed to adaptively tune their contributions. MoPPA serves as a lightweight, plug-and-play module that seamlessly integrates into transformer architectures, with adaptable complexity depending on the local context. Specifically, using MAE pre-trained ViT-B, MoPPA improves PEFT accuracy by up to 2.1% on VTAB-1K image classification with a comparable number of trainable parameters, and advantages are further validated through experiments across various vision backbones, showcasing MoPPA's effectiveness and adaptability. The code will be made public available.

Title: Grayscale to Hyperspectral at Any Resolution Using a Phase-Only Lens

Authors: Dean Hazineh, Federico Capasso, Todd Zickler
Subjects: cs.CV, eess.IV, physics.optics
Abstract URL: https://arxiv.org/abs/2412.02798
Pdf URL: https://arxiv.org/pdf/2412.02798
Copy Paste: [[2412.02798]] Grayscale to Hyperspectral at Any Resolution Using a Phase-Only Lens(https://arxiv.org/abs/2412.02798)
Keywords: diffusion
Abstract: We consider the problem of reconstructing a $H\times W\times 31$ hyperspectral image from a $H\times W$ grayscale snapshot measurement that is captured using a single diffractive optic and a filterless panchromatic photosensor. This problem is severely ill-posed, and we present the first model that is able to produce high-quality results. We train a conditional denoising diffusion model that maps a small grayscale measurement patch to a hyperspectral patch. We then deploy the model to many patches in parallel, using global physics-based guidance to synchronize the patch predictions. Our model can be trained using small hyperspectral datasets and then deployed to reconstruct hyperspectral images of arbitrary size. Also, by drawing multiple samples with different seeds, our model produces useful uncertainty maps. We show that our model achieves state-of-the-art performance on previous snapshot hyperspectral benchmarks where reconstruction is better conditioned. Our work lays the foundation for a new class of high-resolution hyperspectral imagers that are compact and light-efficient.

Title: Minimization of Boolean Complexity in In-Context Concept Learning

Authors: Leroy Z. Wang, R. Thomas McCoy, Shane Steinert-Threlkeld
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02823
Pdf URL: https://arxiv.org/pdf/2412.02823
Copy Paste: [[2412.02823]] Minimization of Boolean Complexity in In-Context Concept Learning(https://arxiv.org/abs/2412.02823)
Keywords: in-context
Abstract: What factors contribute to the relative success and corresponding difficulties of in-context learning for Large Language Models (LLMs)? Drawing on insights from the literature on human concept learning, we test LLMs on carefully designed concept learning tasks, and show that task performance highly correlates with the Boolean complexity of the concept. This suggests that in-context learning exhibits a learning bias for simplicity in a way similar to humans.

Title: Effortless Efficiency: Low-Cost Pruning of Diffusion Models

Authors: Yang Zhang, Er Jin, Yanfei Dong, Ashkan Khakzar, Philip Torr, Johannes Stegmaier, Kenji Kawaguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02852
Pdf URL: https://arxiv.org/pdf/2412.02852
Copy Paste: [[2412.02852]] Effortless Efficiency: Low-Cost Pruning of Diffusion Models(https://arxiv.org/abs/2412.02852)
Keywords: diffusion
Abstract: Diffusion models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which escalates computational complexity and memory demands, complicating deployment, raising inference costs, and causing environmental impact. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to retain the model performance. Retraining a modern large diffusion model is extremely costly and resource-intensive, which limits the practicality of these methods. In this work, we achieve low-cost diffusion pruning without retraining by proposing a model-agnostic structural pruning framework for diffusion models that learns a differentiable mask to sparsify the model. To ensure effective pruning that preserves the quality of the final denoised latent, we design a novel end-to-end pruning objective that spans the entire diffusion process. As end-to-end pruning is memory-intensive, we further propose time step gradient checkpointing, a technique that significantly reduces memory usage during optimization, enabling end-to-end pruning within a limited memory budget. Results on state-of-the-art U-Net diffusion models SDXL and diffusion transformers (FLUX) demonstrate that our method can effectively prune up to 20% parameters with minimal perceptible performance degradation, and notably, without the need for model retraining. We also showcase that our method can still prune on top of time step distilled diffusion models.

Title: MAGMA: Manifold Regularization for MAEs

Authors: Alin Dondera, Anuj Singh, Hadi Jamali-Rad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02871
Pdf URL: https://arxiv.org/pdf/2412.02871
Copy Paste: [[2412.02871]] MAGMA: Manifold Regularization for MAEs(https://arxiv.org/abs/2412.02871)
Keywords: self-supervised
Abstract: Masked Autoencoders (MAEs) are an important divide in self-supervised learning (SSL) due to their independence from augmentation techniques for generating positive (and/or negative) pairs as in contrastive frameworks. Their masking and reconstruction strategy also nicely aligns with SSL approaches in natural language processing. Most MAEs are built upon Transformer-based architectures where visual features are not regularized as opposed to their convolutional neural network (CNN) based counterparts, which can potentially hinder their performance. To address this, we introduce MAGMA, a novel batch-wide layer-wise regularization loss applied to representations of different Transformer layers. We demonstrate that by plugging in the proposed regularization loss, one can significantly improve the performance of MAE-based models. We further demonstrate the impact of the proposed loss on optimizing other generic SSL approaches (such as VICReg and SimCLR), broadening the impact of the proposed approach. Our code base can be found at this https URL.

Title: GUESS: Generative Uncertainty Ensemble for Self Supervision

Authors: Salman Mohamadi, Gianfranco Doretto, Donald A. Adjeroh
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.02896
Pdf URL: https://arxiv.org/pdf/2412.02896
Copy Paste: [[2412.02896]] GUESS: Generative Uncertainty Ensemble for Self Supervision(https://arxiv.org/abs/2412.02896)
Keywords: self-supervised, generative
Abstract: Self-supervised learning (SSL) frameworks consist of pretext task, and loss function aiming to learn useful general features from unlabeled data. The basic idea of most SSL baselines revolves around enforcing the invariance to a variety of data augmentations via the loss function. However, one main issue is that, inattentive or deterministic enforcement of the invariance to any kind of data augmentation is generally not only inefficient, but also potentially detrimental to performance on the downstream tasks. In this work, we investigate the issue from the viewpoint of uncertainty in invariance representation. Uncertainty representation is fairly under-explored in the design of SSL architectures as well as loss functions. We incorporate uncertainty representation in both loss function as well as architecture design aiming for more data-dependent invariance enforcement. The former is represented in the form of data-derived uncertainty in SSL loss function resulting in a generative-discriminative loss function. The latter is achieved by feeding slightly different distorted versions of samples to the ensemble aiming for learning better and more robust representation. Specifically, building upon the recent methods that use hard and soft whitening (a.k.a redundancy reduction), we introduce a new approach GUESS, a pseudo-whitening framework, composed of controlled uncertainty injection, a new architecture, and a new loss function. We include detailed results and ablation analysis establishing GUESS as a new baseline.

Title: Panoptic Diffusion Models: co-generation of images and segmentation maps

Authors: Yinghan Long, Kaushik Roy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02929
Pdf URL: https://arxiv.org/pdf/2412.02929
Copy Paste: [[2412.02929]] Panoptic Diffusion Models: co-generation of images and segmentation maps(https://arxiv.org/abs/2412.02929)
Keywords: diffusion
Abstract: Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.

Title: Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

Authors: Jiahua Xiao, Jiawei Zhang, Dongqing Zou, Xiaodan Zhang, Jimmy Ren, Xing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02960
Pdf URL: https://arxiv.org/pdf/2412.02960
Copy Paste: [[2412.02960]] Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution(https://arxiv.org/abs/2412.02960)
Keywords: diffusion
Abstract: Real-world image super-resolution (Real-ISR) has achieved a remarkable leap by leveraging large-scale text-to-image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super-resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion-based image super-resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super-resolution and segmentation can benefit each other, we propose SegSR which introduces a dual-diffusion framework to facilitate interaction between the image super-resolution and segmentation diffusion models. Specifically, we develop a Dual-Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.

Title: Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

Authors: XiuYu Zhang, Zening Luo, Michelle E. Lu
Subjects: cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2412.02962
Pdf URL: https://arxiv.org/pdf/2412.02962
Copy Paste: [[2412.02962]] Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference(https://arxiv.org/abs/2412.02962)
Keywords: diffusion
Abstract: Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around $70\%$ compared to DistriFusion (the state of the art implementation of PP) and achieves $2.36\sim 8.02\times$ inference speed-up using $4\sim 8$ GPUs compared to $2.32\sim 6.71\times$ achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.

Title: CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D Design Datasets

Authors: XiuYu Zhang, Xiaolei Ye, Jui-Che Chang, Yue Fang
Subjects: cs.CV, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2412.02996
Pdf URL: https://arxiv.org/pdf/2412.02996
Copy Paste: [[2412.02996]] CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D Design Datasets(https://arxiv.org/abs/2412.02996)
Keywords: generative
Abstract: Three-dimensional (3D) objects have wide applications. Despite the growing interest in 3D modeling in academia and industries, designing and/or creating 3D objects from scratch remains time-consuming and challenging. With the development of generative artificial intelligence (AI), designers discover a new way to create images for ideation. However, generative AIs are less useful in creating 3D objects with satisfying qualities. To allow 3D designers to access a wide range of 3D objects for creative activities based on their specific demands, we propose a machine learning (ML) enhanced framework CLAS - named after the four-step of capture, label, associate, and search - to enable fully automatic retrieval of 3D objects based on user specifications leveraging the existing datasets of 3D objects. CLAS provides an effective and efficient method for any person or organization to benefit from their existing but not utilized 3D datasets. In addition, CLAS may also be used to produce high-quality 3D object synthesis datasets for training and evaluating 3D generative models. As a proof of concept, we created and showcased a search system with a web user interface (UI) for retrieving 6,778 3D objects of chairs in the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our retrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy of 42.27%, and top 10 accuracy of 89.64%.

Title: AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

Authors: Shouwei Ruan, Hanqin Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, Xingxing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03002
Pdf URL: https://arxiv.org/pdf/2412.03002
Copy Paste: [[2412.03002]] AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?(https://arxiv.org/abs/2412.03002)
Keywords: generative
Abstract: Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single-view images. AdvDreamer integrates advanced generative techniques with two key innovations and aims to characterize the worst-case distributions of 3D variations from natural images. To ensure adversarial effectiveness and method generality, we introduce an Inverse Semantic Probability Objective that executes adversarial optimization on fundamental vision-text alignment spaces, which can be generalizable across different VLM architectures and downstream tasks. To mitigate the distribution discrepancy between generated and real-world samples while maintaining physical reproducibility, we design a Naturalness Reward Model that provides regularization feedback during adversarial optimization, preventing convergence towards hallucinated and unnatural elements. Leveraging AdvDreamer, we establish MM3DTBench, the first VQA dataset for benchmarking VLMs' 3D variations robustness. Extensive evaluations on representative VLMs with diverse architectures highlight that 3D variations in the real world may pose severe threats to model performance across various tasks.

Title: Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations

Authors: Yu Feng, Shunsi Zhang, Jian Shu, Hanfeng Zhao, Guoliang Pang, Chi Zhang, Hao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03011
Pdf URL: https://arxiv.org/pdf/2412.03011
Copy Paste: [[2412.03011]] Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations(https://arxiv.org/abs/2412.03011)
Keywords: diffusion
Abstract: Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model's detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.

Title: Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Authors: Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03017
Pdf URL: https://arxiv.org/pdf/2412.03017
Copy Paste: [[2412.03017]] Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach(https://arxiv.org/abs/2412.03017)
Keywords: diffusion
Abstract: Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the $\ell_2$-loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at this https URL.

Title: Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

Authors: Xiaofeng Tan, Hongsong Wang, Xin Geng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03044
Pdf URL: https://arxiv.org/pdf/2412.03044
Copy Paste: [[2412.03044]] Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection(https://arxiv.org/abs/2412.03044)
Keywords: diffusion, anomaly
Abstract: Video anomaly detection is an essential yet challenging open-set task in computer vision, often addressed by leveraging reconstruction as a proxy task. However, existing reconstruction-based methods encounter challenges in two main aspects: (1) limited model robustness for open-set scenarios, (2) and an overemphasis on, but restricted capacity for, detailed motion reconstruction. To this end, we propose a novel frequency-guided diffusion model with perturbation training, which enhances the model robustness by perturbation training and emphasizes the principal motion components guided by motion frequencies. Specifically, we first use a trainable generator to produce perturbative samples for perturbation training of the diffusion model. During the perturbation training phase, the model robustness is enhanced and the domain of the reconstructed model is broadened by training against this generator. Subsequently, perturbative samples are introduced for inference, which impacts the reconstruction of normal and abnormal motions differentially, thereby enhancing their separability. Considering that motion details originate from high-frequency information, we propose a masking method based on 2D discrete cosine transform to separate high-frequency information and low-frequency information. Guided by the high-frequency information from observed motion, the diffusion model can focus on generating low-frequency information, and thus reconstructing the motion accurately. Experimental results on five video anomaly detection datasets, including human-related and open-set benchmarks, demonstrate the effectiveness of the proposed method. Our code is available at this https URL.

Title: UTSD: Unified Time Series Diffusion Model

Authors: Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03068
Pdf URL: https://arxiv.org/pdf/2412.03068
Copy Paste: [[2412.03068]] UTSD: Unified Time Series Diffusion Model(https://arxiv.org/abs/2412.03068)
Keywords: diffusion, foundation model
Abstract: Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adapter-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. The empirical results validate the potential of UTSD as a time series foundational model.

Title: Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model

Authors: Joonyong Park, Daisuke Saito, Nobuaki Minematsu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.03074
Pdf URL: https://arxiv.org/pdf/2412.03074
Copy Paste: [[2412.03074]] Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model(https://arxiv.org/abs/2412.03074)
Keywords: self-supervised
Abstract: We examine the text-free speech representations of raw audio obtained from a self-supervised learning (SSL) model by analyzing the synthesized speech using the SSL representations instead of conventional text representations. Since raw audio does not have paired speech representations as transcribed texts do, obtaining speech representations from unpaired speech is crucial for augmenting available datasets for speech synthesis. Specifically, the proposed speech synthesis is conducted using discrete symbol representations from the SSL model in comparison with text representations, and analytical examinations of the synthesized speech have been carried out. The results empirically show that using text representations is advantageous for preserving semantic information, while using discrete symbol representations is superior for preserving acoustic content, including prosodic and intonational information.

Title: Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Authors: Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03079
Pdf URL: https://arxiv.org/pdf/2412.03079
Copy Paste: [[2412.03079]] Align3R: Aligned Monocular Depth Estimation for Dynamic Videos(https://arxiv.org/abs/2412.03079)
Keywords: diffusion
Abstract: Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.

Title: Mimir: Improving Video Diffusion Models for Precise Text Understanding

Authors: Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei Shi, Yujun Shen, Jingdong Chen, Ming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03085
Pdf URL: https://arxiv.org/pdf/2412.03085
Copy Paste: [[2412.03085]] Mimir: Improving Video Diffusion Models for Precise Text Understanding(https://arxiv.org/abs/2412.03085)
Keywords: diffusion
Abstract: Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: this https URL

Title: MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

Authors: Gangjian Zhang, Nanjie Yao, Shunsi Zhang, Hanfeng Zhao, Guoliang Pang, Jian Shu, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03103
Pdf URL: https://arxiv.org/pdf/2412.03103
Copy Paste: [[2412.03103]] MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction(https://arxiv.org/abs/2412.03103)
Keywords: diffusion, generative
Abstract: This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.

Title: Few-Shot Learning with Adaptive Weight Masking in Conditional GANs

Authors: Jiacheng Hu, Zhen Qi, Jianjun Wei, Jiajing Chen, Runyuan Bao, Xinyu Qiu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03105
Pdf URL: https://arxiv.org/pdf/2412.03105
Copy Paste: [[2412.03105]] Few-Shot Learning with Adaptive Weight Masking in Conditional GANs(https://arxiv.org/abs/2412.03105)
Keywords: generative
Abstract: Deep learning has revolutionized various fields, yet its efficacy is hindered by overfitting and the requirement of extensive annotated data, particularly in few-shot learning scenarios where limited samples are available. This paper introduces a novel approach to few-shot learning by employing a Residual Weight Masking Conditional Generative Adversarial Network (RWM-CGAN) for data augmentation. The proposed model integrates residual units within the generator to enhance network depth and sample quality, coupled with a weight mask regularization technique in the discriminator to improve feature learning from small-sample categories. This method addresses the core issues of robustness and generalization in few-shot learning by providing a controlled and clear augmentation of the sample space. Extensive experiments demonstrate that RWM-CGAN not only expands the sample space effectively but also enriches the diversity and quality of generated samples, leading to significant improvements in detection and classification accuracy on public datasets. The paper contributes to the advancement of few-shot learning by offering a practical solution to the challenges posed by data scarcity and the need for rapid generalization to new tasks or categories.

Title: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis

Authors: Siyoon Jin, Jisu Nam, Jiyoung Kim, Dahyun Chung, Yeong-Seok Kim, Joonhyung Park, Heonjeong Chu, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03150
Pdf URL: https://arxiv.org/pdf/2412.03150
Copy Paste: [[2412.03150]] Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis(https://arxiv.org/abs/2412.03150)
Keywords: diffusion
Abstract: Exemplar-based semantic image synthesis aims to generate images aligned with given semantic content while preserving the appearance of an exemplar image. Conventional structure-guidance models, such as ControlNet, are limited in that they cannot directly utilize exemplar images as input, relying instead solely on text prompts to control appearance. Recent tuning-free approaches address this limitation by transferring local appearance from the exemplar image to the synthesized image through implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, these methods face challenges when applied to content-rich scenes with significant geometric deformations, such as driving scenes. In this paper, we propose the Appearance Matching Adapter (AM-Adapter), a learnable framework that enhances cross-image matching within augmented self-attention by incorporating semantic information from segmentation maps. To effectively disentangle generation and matching processes, we adopt a stage-wise training approach. Initially, we train the structure-guidance and generation networks, followed by training the AM-Adapter while keeping the other networks frozen. During inference, we introduce an automated exemplar retrieval method to efficiently select exemplar image-segmentation pairs. Despite utilizing a limited number of learnable parameters, our method achieves state-of-the-art performance, excelling in both semantic alignment preservation and local appearance fidelity. Extensive ablation studies further validate our design choices. Code and pre-trained weights will be publicly available.: this https URL

Title: PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

Authors: Qihan Huang, Long Chan, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03177
Pdf URL: https://arxiv.org/pdf/2412.03177
Copy Paste: [[2412.03177]] PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation(https://arxiv.org/abs/2412.03177)
Keywords: self-supervised
Abstract: Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision model with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance of multiple pre-trained personalized generation models, and achieves state-of-the-art performance on both single-object and multi-object personalized image generation. Our code is available at this https URL.

Title: Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

Authors: Marcin Przewięźlikowski, Randall Balestriero, Wojciech Jasiński, Marek Śmieja, Bartosz Zieliński
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03215
Pdf URL: https://arxiv.org/pdf/2412.03215
Copy Paste: [[2412.03215]] Beyond [cls]: Exploring the true potential of Masked Image Modeling representations(https://arxiv.org/abs/2412.03215)
Keywords: self-supervised
Abstract: Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations. However, for high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA) - another prominent SSL paradigm. To understand this performance gap, we analyze the information flow in Vision Transformers (ViT) learned by both approaches. We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content. Moreover, we demonstrate that MIM-trained ViTs retain valuable information within their patch tokens, which is not effectively captured by the global [cls] token representations. Therefore, selective aggregation of relevant patch tokens, without any fine-tuning, results in consistently higher-quality of MIM representations. To our knowledge, we are the first to highlight the lack of effective representation aggregation as an emergent issue of MIM and propose directions to address it, contributing to future advances in Self-Supervised Learning.

Title: MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Authors: Xiaohe Ma, Valentin Deschaintre, Miloš Hašan, Fujun Luan, Kun Zhou, Hongzhi Wu, Yiwei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03225
Pdf URL: https://arxiv.org/pdf/2412.03225
Copy Paste: [[2412.03225]] MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers(https://arxiv.org/abs/2412.03225)
Keywords: diffusion
Abstract: High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.

Title: DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

Authors: Qingdong He, Jinlong Peng, Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Yong Liu, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03255
Pdf URL: https://arxiv.org/pdf/2412.03255
Copy Paste: [[2412.03255]] DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation(https://arxiv.org/abs/2412.03255)
Keywords: diffusion
Abstract: To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a fixed number of conditions, which does not fully address the complexity of multiple conditions and their potential conflicts. This underscores the need for innovative approaches to manage multiple conditions effectively for more reliable and detailed image synthesis. To address this issue, we propose a novel framework, DynamicControl, which supports dynamic combinations of diverse control signals, allowing adaptive selection of different numbers and types of conditions. Our approach begins with a double-cycle controller that generates an initial real score sorting for all input conditions by leveraging pre-trained conditional generation models and discriminative models. This controller evaluates the similarity between extracted conditions and input conditions, as well as the pixel-level similarity with the source image. Then, we integrate a Multimodal Large Language Model (MLLM) to build an efficient condition evaluator. This evaluator optimizes the ordering of conditions based on the double-cycle controller's score ranking. Our method jointly optimizes MLLMs and diffusion models, utilizing MLLMs' reasoning capabilities to facilitate multi-condition text-to-image (T2I) tasks. The final sorted conditions are fed into a parallel multi-control adapter, which learns feature maps from dynamic visual conditions and integrates them to modulate ControlNet, thereby enhancing control over generated images. Through both quantitative and qualitative comparisons, DynamicControl demonstrates its superiority over existing methods in terms of controllability, generation quality and composability under various conditional controls.

Title: RFSR: Improving ISR Diffusion Models via Reward Feedback Learning

Authors: Xiaopeng Sun, Qinwei Lin, Yu Gao, Yujie Zhong, Chengjian Feng, Dengjie Li, Zheng Zhao, Jie Hu, Lin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03268
Pdf URL: https://arxiv.org/pdf/2412.03268
Copy Paste: [[2412.03268]] RFSR: Improving ISR Diffusion Models via Reward Feedback Learning(https://arxiv.org/abs/2412.03268)
Keywords: diffusion, generative
Abstract: Generative diffusion models (DM) have been extensively utilized in image super-resolution (ISR). Most of the existing methods adopt the denoising loss from DDPMs for model optimization. We posit that introducing reward feedback learning to finetune the existing models can further improve the quality of the generated images. In this paper, we propose a timestep-aware training strategy with reward feedback learning. Specifically, in the initial denoising stages of ISR diffusion, we apply low-frequency constraints to super-resolution (SR) images to maintain structural stability. In the later denoising stages, we use reward feedback learning to improve the perceptual and aesthetic quality of the SR images. In addition, we incorporate Gram-KL regularization to alleviate stylization caused by reward hacking. Our method can be integrated into any diffusion-based ISR model in a plug-and-play manner. Experiments show that ISR diffusion models, when fine-tuned with our method, significantly improve the perceptual and aesthetic quality of SR images, achieving excellent subjective results. Code: this https URL

Title: Intent-driven In-context Learning for Few-shot Dialogue State Tracking

Authors: Zihao Yi, Zhe Xu, Ying Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03270
Pdf URL: https://arxiv.org/pdf/2412.03270
Copy Paste: [[2412.03270]] Intent-driven In-context Learning for Few-shot Dialogue State Tracking(https://arxiv.org/abs/2412.03270)
Keywords: in-context
Abstract: Dialogue state tracking (DST) plays an essential role in task-oriented dialogue systems. However, user's input may contain implicit information, posing significant challenges for DST tasks. Additionally, DST data includes complex information, which not only contains a large amount of noise unrelated to the current turn, but also makes constructing DST datasets expensive. To address these challenges, we introduce Intent-driven In-context Learning for Few-shot DST (IDIC-DST). By extracting user's intent, we propose an Intent-driven Dialogue Information Augmentation module to augment the dialogue information, which can track dialogue states more effectively. Moreover, we mask noisy information from DST data and rewrite user's input in the Intent-driven Examples Retrieval module, where we retrieve similar examples. We then utilize a pre-trained large language model to update the dialogue state using the augmented dialogue information and examples. Experimental results demonstrate that IDIC-DST achieves state-of-the-art performance in few-shot settings on MultiWOZ 2.1 and MultiWOZ 2.4 datasets.

Title: AntLM: Bridging Causal and Masked Language Models

Authors: Xinru Yu, Bin Guo, Shiwei Luo, Jie Wang, Tao Ji, Yuanbin Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.03275
Pdf URL: https://arxiv.org/pdf/2412.03275
Copy Paste: [[2412.03275]] AntLM: Bridging Causal and Masked Language Models(https://arxiv.org/abs/2412.03275)
Keywords: foundation model
Abstract: Causal Language Modeling (CLM) and Masked Language Modeling (MLM) are two mainstream learning paradigms based on Transformer networks, specifically the Decoder-only and Encoder-only architectures. The strengths of each paradigm in downstream tasks have shown a mix of advantages and disadvantages. In the past BabyLM Challenge 2023, although the MLM paradigm achieved the best average performance, the CLM paradigm demonstrated significantly faster convergence rates. For the BabyLM Challenge 2024, we propose a novel language modeling paradigm named $\textbf{AntLM}$, which integrates both CLM and MLM to leverage the advantages of these two classic paradigms. We chose the strict-small track and conducted experiments on two foundation models: BabyLlama, representing CLM, and LTG-BERT, representing MLM. During the training process for specific foundation models, we alternate between applying CLM or MLM training objectives and causal or bidirectional attention masks. Experimental results show that combining the two pretraining objectives leverages their strengths, enhancing overall training performance. Under the same epochs, $AntLM_{BabyLlama}$ improves Macro-average by 1%, and $AntLM_{LTG-BERT}$ achieves a 2.2% increase over the baselines.

Title: Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

Authors: Andreas Müller, Denis Lukovnikov, Jonas Thietke, Asja Fischer, Erwin Quiring
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.03283
Pdf URL: https://arxiv.org/pdf/2412.03283
Copy Paste: [[2412.03283]] Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models(https://arxiv.org/abs/2412.03283)
Keywords: diffusion
Abstract: Integrating watermarking into the generation process of latent diffusion models (LDMs) simplifies detection and attribution of generated content. Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel class of watermarking techniques that are easy to implement and highly robust against various perturbations. However, our work demonstrates a fundamental security vulnerability of semantic watermarks. We show that attackers can leverage unrelated models, even with different latent spaces and architectures (UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically, we design two watermark forgery attacks. The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM to get closer to the latent representation of a watermarked image. We also show that this technique can be used for watermark removal. The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt. Both attacks just need a single reference image with the target watermark. Overall, our findings question the applicability of semantic watermarks by revealing that attackers can easily forge or remove these watermarks under realistic conditions.

Title: Equivariant Representation Learning for Augmentation-based Self-Supervised Learning via Image Reconstruction

Authors: Qin Wang, Kai Krajsek, Hanno Scharr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03314
Pdf URL: https://arxiv.org/pdf/2412.03314
Copy Paste: [[2412.03314]] Equivariant Representation Learning for Augmentation-based Self-Supervised Learning via Image Reconstruction(https://arxiv.org/abs/2412.03314)
Keywords: self-supervised, foundation model
Abstract: Augmentation-based self-supervised learning methods have shown remarkable success in self-supervised visual representation learning, excelling in learning invariant features but often neglecting equivariant ones. This limitation reduces the generalizability of foundation models, particularly for downstream tasks requiring equivariance. We propose integrating an image reconstruction task as an auxiliary component in augmentation-based self-supervised learning algorithms to facilitate equivariant feature learning without additional parameters. Our method implements a cross-attention mechanism to blend features learned from two augmented views, subsequently reconstructing one of them. This approach is adaptable to various datasets and augmented-pair based learning methods. We evaluate its effectiveness on learning equivariant features through multiple linear regression tasks and downstream applications on both artificial (3DIEBench) and natural (ImageNet) datasets. Results consistently demonstrate significant improvements over standard augmentation-based self-supervised learning methods and state-of-the-art approaches, particularly excelling in scenarios involving combined augmentations. Our method enhances the learning of both invariant and equivariant features, leading to more robust and generalizable visual representations for computer vision tasks.

Title: Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis

Authors: Tao Jun Lin, Wenqing Wang, Yujiao Shi, Akhil Perincherry, Ankit Vora, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03315
Pdf URL: https://arxiv.org/pdf/2412.03315
Copy Paste: [[2412.03315]] Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis(https://arxiv.org/abs/2412.03315)
Keywords: diffusion
Abstract: This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa. We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively. Unlike previous works that typically focus on one-to-one generation, producing a single output image from a single input image, our approach acknowledges the inherent one-to-many nature of the problem. This recognition stems from the challenges posed by differences in illumination, weather conditions, and occlusions between the two views. To effectively model this uncertainty, we leverage recent advancements in diffusion models. Specifically, we exploit random Gaussian noise to represent the diverse possibilities learnt from the target view data. We introduce a Geometry-guided Cross-view Condition (GCC) strategy to establish explicit geometric correspondences between satellite and street-view features. This enables us to resolve the geometry ambiguity introduced by camera pose between image pairs, boosting the performance of cross-view image synthesis. Through extensive quantitative and qualitative analyses on three benchmark cross-view datasets, we demonstrate the superiority of our proposed geometry-guided cross-view condition over baseline methods, including recent state-of-the-art approaches in cross-view image synthesis. Our method generates images of higher quality, fidelity, and diversity than other state-of-the-art approaches.

Title: UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection

Authors: Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03342
Pdf URL: https://arxiv.org/pdf/2412.03342
Copy Paste: [[2412.03342]] UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection(https://arxiv.org/abs/2412.03342)
Keywords: foundation model, anomaly
Abstract: Visual Anomaly Detection (VAD) aims to identify abnormal samples in images that deviate from normal patterns, covering multiple domains, including industrial, logical, and medical fields. Due to the domain gaps between these fields, existing VAD methods are typically tailored to each domain, with specialized detection techniques and model architectures that are difficult to generalize across different domains. Moreover, even within the same domain, current VAD approaches often follow a "one-category-one-model" paradigm, requiring large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, such as industrial, logical, and medical anomalies, with a training-free unified model. UniVAD only needs few normal samples as references during testing to detect anomalies in previously unseen objects, without training on the specific domain. Specifically, UniVAD employs a Contextual Component Clustering ($C^3$) module based on clustering and vision foundation models to segment components within the image accurately, and leverages Component-Aware Patch Matching (CAPM) and Graph-Enhanced Component Modeling (GECM) modules to detect anomalies at different semantic levels, which are aggregated to produce the final detection result. We conduct experiments on nine datasets spanning industrial, logical, and medical fields, and the results demonstrate that UniVAD achieves state-of-the-art performance in few-shot anomaly detection tasks across multiple domains, outperforming domain-specific anomaly detection models. The code will be made publicly available.

Title: DIVE: Taming DINO for Subject-Driven Video Editing

Authors: Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03347
Pdf URL: https://arxiv.org/pdf/2412.03347
Copy Paste: [[2412.03347]] DIVE: Taming DINO for Subject-Driven Video Editing(https://arxiv.org/abs/2412.03347)
Keywords: diffusion
Abstract: Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject's identity. Project page: this https URL

Title: Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification

Authors: Alexandre Fournier-Montgieux, Michael Soumm, Adrian Popescu, Bertrand Luvison, Hervé Le Borgne
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03349
Pdf URL: https://arxiv.org/pdf/2412.03349
Copy Paste: [[2412.03349]] Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification(https://arxiv.org/abs/2412.03349)
Keywords: generative
Abstract: Face recognition and verification are two computer vision tasks whose performances have advanced with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive nature of face data and biases in real-world training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems remain. Using the existing DCFace SOTA framework, we introduce a new controlled generation pipeline that improves fairness. Through classical fairness metrics and a proposed in-depth statistical analysis based on logit models and ANOVA, we show that our generation pipeline improves fairness more than other bias mitigation approaches while slightly improving raw performance.

Title: TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

Authors: Qinwei Lin, Xiaopeng Sun, Yu Gao, Yujie Zhong, Dengjie Li, Zheng Zhao, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03355
Pdf URL: https://arxiv.org/pdf/2412.03355
Copy Paste: [[2412.03355]] TASR: Timestep-Aware Diffusion Model for Image Super-Resolution(https://arxiv.org/abs/2412.03355)
Keywords: diffusion
Abstract: Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via this http URL this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: this https URL

Title: Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment

Authors: Feng He, Chao Zhang, Zhixue Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03400
Pdf URL: https://arxiv.org/pdf/2412.03400
Copy Paste: [[2412.03400]] Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment(https://arxiv.org/abs/2412.03400)
Keywords: diffusion
Abstract: Implicit assumptions and priors are often necessary in text-to-image generation tasks, especially when textual prompts lack sufficient context. However, these assumptions can sometimes reflect outdated concepts, inaccuracies, or societal bias embedded in the training data. We present Embedding-only Editing (Embedit), a method designed to efficiently adjust implict assumptions and priors in the model without affecting its interpretation of unrelated objects or overall performance. Given a "source" prompt (e.g., "rose") that elicits an implicit assumption (e.g., rose is red) and a "destination" prompt that specifies the desired attribute (e.g., "blue rose"), Embedit fine-tunes only the word token embedding (WTE) of the target object ("rose") to optimize the last hidden state of text encoder in Stable Diffusion, a SOTA text-to-image model. This targeted adjustment prevents unintended effects on other objects in the model's knowledge base, as the WTEs for unrelated objects and the model weights remain unchanged. Consequently, when a prompt does not contain the edited object, all representations, and the model outputs are identical to those of the original, unedited model. Our method is highly efficient, modifying only 768 parameters for Stable Diffusion 1.4 and 2048 for XL in a single edit, matching the WTE dimension of each respective model. This minimal scope, combined with rapid execution, makes Embedit highly practical for real-world applications. Additionally, changes are easily reversible by restoring the original WTE layers. Our experimental results demonstrate that Embedit consistently outperforms previous methods across various models, tasks, and editing scenarios (both single and sequential multiple edits), achieving at least a 6.01% improvement (from 87.17% to 93.18%).

Title: Skel3D: Skeleton Guided Novel View Synthesis

Authors: Aron Fóthi, Bence Fazekas, Natabara Máté Gyöngyössy, Kristian Fenech
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03407
Pdf URL: https://arxiv.org/pdf/2412.03407
Copy Paste: [[2412.03407]] Skel3D: Skeleton Guided Novel View Synthesis(https://arxiv.org/abs/2412.03407)
Keywords: diffusion, generative
Abstract: In this paper, we present an approach for monocular open-set novel view synthesis (NVS) that leverages object skeletons to guide the underlying diffusion model. Building upon a baseline that utilizes a pre-trained 2D image generator, our method takes advantage of the Objaverse dataset, which includes animated objects with bone structures. By introducing a skeleton guide layer following the existing ray conditioning normalization (RCN) layer, our approach enhances pose accuracy and multi-view consistency. The skeleton guide layer provides detailed structural information for the generative model, improving the quality of synthesized views. Experimental results demonstrate that our skeleton-guided method significantly enhances consistency and accuracy across diverse object categories within the Objaverse dataset. Our method outperforms existing state-of-the-art NVS techniques both quantitatively and qualitatively, without relying on explicit 3D representations.

Title: Assessing Foundation Models' Transferability to Physiological Signals in Precision Medicine

Authors: Matthias Christenson, Cove Geary, Brian Locke, Pranav Koirala, Warren Woodrich Pettine
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.03427
Pdf URL: https://arxiv.org/pdf/2412.03427
Copy Paste: [[2412.03427]] Assessing Foundation Models' Transferability to Physiological Signals in Precision Medicine(https://arxiv.org/abs/2412.03427)
Keywords: foundation model
Abstract: The success of precision medicine requires computational models that can effectively process and interpret diverse physiological signals across heterogeneous patient populations. While foundation models have demonstrated remarkable transfer capabilities across various domains, their effectiveness in handling individual-specific physiological signals - crucial for precision medicine - remains largely unexplored. This work introduces a systematic pipeline for rapidly and efficiently evaluating foundation models' transfer capabilities in medical contexts. Our pipeline employs a three-stage approach. First, it leverages physiological simulation software to generate diverse, clinically relevant scenarios, particularly focusing on data-scarce medical conditions. This simulation-based approach enables both targeted capability assessment and subsequent model fine-tuning. Second, the pipeline projects these simulated signals through the foundation model to obtain embeddings, which are then evaluated using linear methods. This evaluation quantifies the model's ability to capture three critical aspects: physiological feature independence, temporal dynamics preservation, and medical scenario differentiation. Finally, the pipeline validates these representations through specific downstream medical tasks. Initial testing of our pipeline on the Moirai time series foundation model revealed significant limitations in physiological signal processing, including feature entanglement, temporal dynamics distortion, and reduced scenario discrimination. These findings suggest that current foundation models may require substantial architectural modifications or targeted fine-tuning before deployment in clinical settings.

Title: SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

Authors: Yan Li, Ziya Zhou, Zhiqiang Wang, Wei Xue, Wenhan Luo, Yike Guo
Subjects: cs.CV, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2412.03430
Pdf URL: https://arxiv.org/pdf/2412.03430
Copy Paste: [[2412.03430]] SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model(https://arxiv.org/abs/2412.03430)
Keywords: diffusion, generative
Abstract: Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model to enhance singing video generation performance, resulting in our proposed model, SINGER. Furthermore, the lack of high-quality real-world singing face videos has hindered the development of the singing video generation community. To address this gap, we have collected an in-the-wild audio-visual singing dataset to facilitate research in this area. Our experiments demonstrate that SINGER is capable of generating vivid singing videos and outperforms state-of-the-art methods in both objective and subjective evaluations.

Title: CleanDIFT: Diffusion Features without Noise

Authors: Nick Stracke, Stefan Andreas Baumann, Kolja Bauer, Frank Fundel, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03439
Pdf URL: https://arxiv.org/pdf/2412.03439
Copy Paste: [[2412.03439]] CleanDIFT: Diffusion Features without Noise(https://arxiv.org/abs/2412.03439)
Keywords: diffusion
Abstract: Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

Title: State Frequency Estimation for Anomaly Detection

Authors: Clinton Cao, Agathe Blaise, Annibale Panichella, Sicco Verwer
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.03442
Pdf URL: https://arxiv.org/pdf/2412.03442
Copy Paste: [[2412.03442]] State Frequency Estimation for Anomaly Detection(https://arxiv.org/abs/2412.03442)
Keywords: anomaly
Abstract: Many works have studied the efficacy of state machines for detecting anomalies within NetFlows. These works typically learn a model from unlabeled data and compute anomaly scores for arbitrary traces based on their likelihood of occurrence or how well they fit within the model. However, these methods do not dynamically adapt their scores based on the traces seen at test time. This becomes a problem when an adversary produces seemingly common traces in their attack, causing the model to miss the detection by assigning low anomaly scores. We propose SEQUENT, a new approach that uses the state visit frequency to adapt its scoring for anomaly detection dynamically. SEQUENT subsequently uses the scores to generate root causes for anomalies. These allow the grouping of alarms and simplify the analysis of anomalies. Our evaluation of SEQUENT on three NetFlow datasets indicates that our approach outperforms existing methods, demonstrating its effectiveness in detecting anomalies.

Title: Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks

Authors: Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03453
Pdf URL: https://arxiv.org/pdf/2412.03453
Copy Paste: [[2412.03453]] Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks(https://arxiv.org/abs/2412.03453)
Keywords: foundation model, generative
Abstract: Attackers can deliberately perturb classifiers' input with subtle noise, altering final predictions. Among proposed countermeasures, adversarial purification employs generative networks to preprocess input images, filtering out adversarial noise. In this study, we propose specific generators, defined Multiple Latent Variable Generative Models (MLVGMs), for adversarial purification. These models possess multiple latent variables that naturally disentangle coarse from fine features. Taking advantage of these properties, we autoencode images to maintain class-relevant information, while discarding and re-sampling any detail, including adversarial noise. The procedure is completely training-free, exploring the generalization abilities of pre-trained MLVGMs on the adversarial purification downstream task. Despite the lack of large models, trained on billions of samples, we show that smaller MLVGMs are already competitive with traditional methods, and can be used as foundation models. Official code released at this https URL.

Title: Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective

Authors: Neta Shaul, Itai Gat, Marton Havasi, Daniel Severo, Anuroop Sriram, Peter Holderrieth, Brian Karrer, Yaron Lipman, Ricky T. Q. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03487
Pdf URL: https://arxiv.org/pdf/2412.03487
Copy Paste: [[2412.03487]] Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective(https://arxiv.org/abs/2412.03487)
Keywords: diffusion, generative
Abstract: The design space of discrete-space diffusion or flow generative models are significantly less well-understood than their continuous-space counterparts, with many works focusing only on a simple masked construction. In this work, we aim to take a holistic approach to the construction of discrete generative models based on continuous-time Markov chains, and for the first time, allow the use of arbitrary discrete probability paths, or colloquially, corruption processes. Through the lens of optimizing the symmetric kinetic energy, we propose velocity formulas that can be applied to any given probability path, completely decoupling the probability and velocity, and giving the user the freedom to specify any desirable probability path based on expert knowledge specific to the data domain. Furthermore, we find that a special construction of mixture probability paths optimizes the symmetric kinetic energy for the discrete case. We empirically validate the usefulness of this new design space across multiple modalities: text generation, inorganic material generation, and image generation. We find that we can outperform the mask construction even in text with kinetic-optimal mixture paths, while we can make use of domain-specific constructions of the probability path over the visual domain.

Title: Distillation of Diffusion Features for Semantic Correspondence

Authors: Frank Fundel, Johannes Schusterbauer, Vincent Tao Hu, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03512
Pdf URL: https://arxiv.org/pdf/2412.03512
Copy Paste: [[2412.03512]] Distillation of Diffusion Features for Semantic Correspondence(https://arxiv.org/abs/2412.03512)
Keywords: diffusion, foundation model, generative
Abstract: Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

Title: Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Authors: Shengyuan Zhang, An Zhao, Ling Yang, Zejian Li, Chenye Meng, Haoran Xu, Tianrun Chen, AnYang Wei, Perry Pengyun GU, Lingyun Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03515
Pdf URL: https://arxiv.org/pdf/2412.03515
Copy Paste: [[2412.03515]] Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion(https://arxiv.org/abs/2412.03515)
Keywords: diffusion
Abstract: Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality. However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed $\textbf{ScoreLiDAR}$, which achieves efficient yet high-quality scene completion. ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation. To improve completion quality, we also introduce a novel $\textbf{Structural Loss}$, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene. The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration. Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame ($>$5$\times$) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models. Our code is publicly available at this https URL.

Title: NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

Authors: Lingen Li, Zhaoyang Zhang, Yaowei Li, Jiale Xu, Xiaoyu Li, Wenbo Hu, Weihao Cheng, Jinwei Gu, Tianfan Xue, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03517
Pdf URL: https://arxiv.org/pdf/2412.03517
Copy Paste: [[2412.03517]] NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images(https://arxiv.org/abs/2412.03517)
Keywords: diffusion, generative
Abstract: Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views. In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from dense stereo models during training. Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems.

Title: Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Authors: Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu Han, Wangmeng Zuo, Ji Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03520
Pdf URL: https://arxiv.org/pdf/2412.03520
Copy Paste: [[2412.03520]] Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention(https://arxiv.org/abs/2412.03520)
Keywords: diffusion
Abstract: Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at this https URL.

Title: NODE-AdvGAN: Improving the transferability and perceptual similarity of adversarial examples by dynamic-system-driven adversarial generative model

Authors: Xinheng Xie, Yue Wu, Cuiyu He
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03539
Pdf URL: https://arxiv.org/pdf/2412.03539
Copy Paste: [[2412.03539]] NODE-AdvGAN: Improving the transferability and perceptual similarity of adversarial examples by dynamic-system-driven adversarial generative model(https://arxiv.org/abs/2412.03539)
Keywords: generative
Abstract: Understanding adversarial examples is crucial for improving the model's robustness, as they introduce imperceptible perturbations that deceive models. Effective adversarial examples, therefore, offer the potential to train more robust models by removing their singularities. We propose NODE-AdvGAN, a novel approach that treats adversarial generation as a continuous process and employs a Neural Ordinary Differential Equation (NODE) for simulating the dynamics of the generator. By mimicking the iterative nature of traditional gradient-based methods, NODE-AdvGAN generates smoother and more precise perturbations that preserve high perceptual similarity when added to benign images. We also propose a new training strategy, NODE-AdvGAN-T, which enhances transferability in black-box attacks by effectively tuning noise parameters during training. Experiments demonstrate that NODE-AdvGAN and NODE-AdvGAN-T generate more effective adversarial examples that achieve higher attack success rates while preserving better perceptual quality than traditional GAN-based methods.

Title: MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Authors: Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03558
Pdf URL: https://arxiv.org/pdf/2412.03558
Copy Paste: [[2412.03558]] MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation(https://arxiv.org/abs/2412.03558)
Keywords: diffusion
Abstract: This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.

Title: FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

Authors: Lue Fan, Hao Zhang, Qitai Wang, Hongsheng Li, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03566
Pdf URL: https://arxiv.org/pdf/2412.03566
Copy Paste: [[2412.03566]] FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes(https://arxiv.org/abs/2412.03566)
Keywords: generative
Abstract: We propose FreeSim, a camera simulation method for autonomous driving. FreeSim emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In such viewpoints, previous methods have unacceptable degradation because the training data of these viewpoints is unavailable. To address such data scarcity, we first propose a generative enhancement model with a matched data construction strategy. The resulting model can generate high-quality images in a viewpoint slightly deviated from the recorded trajectories, conditioned on the degraded rendering of this viewpoint. We then propose a progressive reconstruction strategy, which progressively adds generated images of unrecorded views into the reconstruction process, starting from slightly off-trajectory viewpoints and moving progressively farther away. With this progressive generation-reconstruction pipeline, FreeSim supports high-quality off-trajectory view synthesis under large deviations of more than 3 meters.

Title: Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Authors: Qitao Zhao, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03570
Pdf URL: https://arxiv.org/pdf/2412.03570
Copy Paste: [[2412.03570]] Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis(https://arxiv.org/abs/2412.03570)
Keywords: generative
Abstract: Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks -- accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems' pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines.

Title: Navigation World Models

Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.03572
Pdf URL: https://arxiv.org/pdf/2412.03572
Copy Paste: [[2412.03572]] Navigation World Models(https://arxiv.org/abs/2412.03572)
Keywords: diffusion
Abstract: Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.