2024-12-03

Title: DiffGuard: Text-Based Safety Checker for Diffusion Models

Authors: Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00064
Pdf URL: https://arxiv.org/pdf/2412.00064
Copy Paste: [[2412.00064]] DiffGuard: Text-Based Safety Checker for Diffusion Models(https://arxiv.org/abs/2412.00064)
Keywords: diffusion
Abstract: Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.

Title: Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions

Authors: Justin Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00073
Pdf URL: https://arxiv.org/pdf/2412.00073
Copy Paste: [[2412.00073]] Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions(https://arxiv.org/abs/2412.00073)
Keywords: diffusion, generative
Abstract: The rise of advanced AI models like Generative Adversarial Networks (GANs) and diffusion models such as Stable Diffusion has made the creation of highly realistic images accessible, posing risks of misuse in misinformation and manipulation. This study evaluates the effectiveness of convolutional neural networks (CNNs), as well as DenseNet architectures, for detecting AI-generated images. Using variations of the CIFAKE dataset, including images generated by different versions of Stable Diffusion, we analyze the impact of updates and modifications such as Gaussian blurring, prompt text changes, and Low-Rank Adaptation (LoRA) on detection accuracy. The findings highlight vulnerabilities in current detection methods and propose strategies to enhance the robustness and reliability of AI-image detection systems.

Title: Unpacking the Individual Components of Diffusion Policy

Authors: Xiu Yuan
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00084
Pdf URL: https://arxiv.org/pdf/2412.00084
Copy Paste: [[2412.00084]] Unpacking the Individual Components of Diffusion Policy(https://arxiv.org/abs/2412.00084)
Keywords: diffusion
Abstract: Imitation Learning presents a promising approach for learning generalizable and complex robotic skills. The recently proposed Diffusion Policy generates robot action sequences through a conditional denoising diffusion process, achieving state-of-the-art performance compared to other imitation learning methods. This paper summarizes five key components of Diffusion Policy: 1) observation sequence input; 2) action sequence execution; 3) receding horizon; 4) U-Net or Transformer network architecture; and 5) FiLM conditioning. By conducting experiments across ManiSkill and Adroit benchmarks, this study aims to elucidate the contribution of each component to the success of Diffusion Policy in various scenarios. We hope our findings will provide valuable insights for the application of Diffusion Policy in future research and industry.

Title: Graph Canvas for Controllable 3D Scene Generation

Authors: Libin Liu, Shen Chen, Sen Jia, Jingzhe Shi, Zhongyu Jiang, Can Jin, Wu Zongkai, Jenq-Neng Hwang, Lei Li
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00091
Pdf URL: https://arxiv.org/pdf/2412.00091
Copy Paste: [[2412.00091]] Graph Canvas for Controllable 3D Scene Generation(https://arxiv.org/abs/2412.00091)
Keywords: in-context
Abstract: Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce \textbf{GraphCanvas3D}, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: this https URL.

Title: A Novel Approach to Image Steganography Using Generative Adversarial Networks

Authors: Waheed Rehman
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00094
Pdf URL: https://arxiv.org/pdf/2412.00094
Copy Paste: [[2412.00094]] A Novel Approach to Image Steganography Using Generative Adversarial Networks(https://arxiv.org/abs/2412.00094)
Keywords: generative
Abstract: The field of steganography has long been focused on developing methods to securely embed information within various digital media while ensuring imperceptibility and robustness. However, the growing sophistication of detection tools and the demand for increased data hiding capacity have revealed limitations in traditional techniques. In this paper, we propose a novel approach to image steganography that leverages the power of generative adversarial networks (GANs) to address these challenges. By employing a carefully designed GAN architecture, our method ensures the creation of stego-images that are visually indistinguishable from their original counterparts, effectively thwarting detection by advanced steganalysis tools. Additionally, the adversarial training paradigm optimizes the balance between embedding capacity, imperceptibility, and robustness, enabling more efficient and secure data hiding. We evaluate our proposed method through a series of experiments on benchmark datasets and compare its performance against baseline techniques, including least significant bit (LSB) substitution and discrete cosine transform (DCT)-based methods. Our results demonstrate significant improvements in metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and robustness against detection. This work not only contributes to the advancement of image steganography but also provides a foundation for exploring GAN-based approaches for secure digital communication.

Title: Steering Rectified Flow Models in the Vector Field for Controlled Image Generation

Authors: Maitreya Patel, Song Wen, Dimitris N. Metaxas, Yezhou Yang
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00100
Pdf URL: https://arxiv.org/pdf/2412.00100
Copy Paste: [[2412.00100]] Steering Rectified Flow Models in the Vector Field for Controlled Image Generation(https://arxiv.org/abs/2412.00100)
Keywords: diffusion
Abstract: Diffusion models (DMs) excel in photorealism, image editing, and solving inverse problems, aided by classifier-free guidance and image inversion techniques. However, rectified flow models (RFMs) remain underexplored for these tasks. Existing DM-based methods often require additional training, lack generalization to pretrained latent models, underperform, and demand significant computational resources due to extensive backpropagation through ODE solvers and inversion processes. In this work, we first develop a theoretical and empirical understanding of the vector field dynamics of RFMs in efficiently guiding the denoising trajectory. Our findings reveal that we can navigate the vector field in a deterministic and gradient-free manner. Utilizing this property, we propose FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation tasks, facilitated by gradient skipping. FlowChef is a unified framework for controlled image generation that, for the first time, simultaneously addresses classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. Finally, we perform extensive evaluations and show that FlowChef significantly outperforms baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Project Page: \url{this https URL}.

Title: Differential learning kinetics govern the transition from memorization to generalization during in-context learning

Authors: Alex Nguyen, Gautam Reddy
Subjects: cs.LG, cond-mat.dis-nn, cs.AI, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2412.00104
Pdf URL: https://arxiv.org/pdf/2412.00104
Copy Paste: [[2412.00104]] Differential learning kinetics govern the transition from memorization to generalization during in-context learning(https://arxiv.org/abs/2412.00104)
Keywords: in-context
Abstract: Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.

Title: Demographic Predictability in 3D CT Foundation Embeddings

Authors: Guangyao Zheng, Michael A. Jacobs, Vishwa S. Parekh
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00110
Pdf URL: https://arxiv.org/pdf/2412.00110
Copy Paste: [[2412.00110]] Demographic Predictability in 3D CT Foundation Embeddings(https://arxiv.org/abs/2412.00110)
Keywords: self-supervised, foundation model
Abstract: Self-supervised foundation models have recently been successfully extended to encode three-dimensional (3D) computed tomography (CT) images, with excellent performance across several downstream tasks, such as intracranial hemorrhage detection and lung cancer risk forecasting. However, as self-supervised models learn from complex data distributions, questions arise concerning whether these embeddings capture demographic information, such as age, sex, or race. Using the National Lung Screening Trial (NLST) dataset, which contains 3D CT images and demographic data, we evaluated a range of classifiers: softmax regression, linear regression, linear support vector machine, random forest, and decision tree, to predict sex, race, and age of the patients in the images. Our results indicate that the embeddings effectively encoded age and sex information, with a linear regression model achieving a root mean square error (RMSE) of 3.8 years for age prediction and a softmax regression model attaining an AUC of 0.998 for sex classification. Race prediction was less effective, with an AUC of 0.878. These findings suggest a detailed exploration into the information encoded in self-supervised learning frameworks is needed to help ensure fair, responsible, and patient privacy-protected healthcare AI.

Title: SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Authors: Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00114
Pdf URL: https://arxiv.org/pdf/2412.00114
Copy Paste: [[2412.00114]] SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments(https://arxiv.org/abs/2412.00114)
Keywords: diffusion
Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.

Title: OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Authors: Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00115
Pdf URL: https://arxiv.org/pdf/2412.00115
Copy Paste: [[2412.00115]] OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation(https://arxiv.org/abs/2412.00115)
Keywords: diffusion
Abstract: Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce \textbf{OpenHumanVid}, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. The source code and the dataset are available at: \href{this https URL}{this https URL}.

Title: Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Authors: Xuexiang Niu, Jinping Tang, Lei Wang, Ge Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00122
Pdf URL: https://arxiv.org/pdf/2412.00122
Copy Paste: [[2412.00122]] Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback(https://arxiv.org/abs/2412.00122)
Keywords: diffusion
Abstract: Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at this https URL.

Title: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Authors: Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00127
Pdf URL: https://arxiv.org/pdf/2412.00127
Copy Paste: [[2412.00127]] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads(https://arxiv.org/abs/2412.00127)
Keywords: diffusion
Abstract: We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.

Title: PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

Authors: ShuaiHeng Li, Qing Cai, Fan Zhang, Menghuan Zhang, Yangyang Shu, Zhi Liu, Huafeng Li, Lingqiao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00134
Pdf URL: https://arxiv.org/pdf/2412.00134
Copy Paste: [[2412.00134]] PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition(https://arxiv.org/abs/2412.00134)
Keywords: self-supervised
Abstract: Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.

Title: FonTS: Text Rendering with Typography and Style Controls

Authors: Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, Xingxing Zou
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00136
Pdf URL: https://arxiv.org/pdf/2412.00136
Copy Paste: [[2412.00136]] FonTS: Text Rendering with Typography and Style Controls(https://arxiv.org/abs/2412.00136)
Keywords: diffusion
Abstract: Visual text images are prevalent in various applications, requiring careful font selection and typographic choices. Recent advances in Diffusion Transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still face challenges such as inconsistent fonts, style variation, and limited fine-grained control, particularly at the word level. This paper proposes a two-stage DiT-based pipeline to address these issues by enhancing controllability over typography and style in text rendering. We introduce Typography Control (TC) finetuning, an efficient parameter fine-tuning method, and enclosing typography control tokens (ETC-tokens), which enable precise word-level application of typographic features. To further enhance style control, we present a Style Control Adapter (SCA) that injects style information through image inputs independent of text prompts. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in Basic and Artistic Text Rendering (BTR and ATR) tasks. Our results mark a significant advancement in the precision and adaptability of T2I models, presenting new possibilities for creative applications and design-oriented tasks.

Title: Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

Authors: Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00142
Pdf URL: https://arxiv.org/pdf/2412.00142
Copy Paste: [[2412.00142]] Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers(https://arxiv.org/abs/2412.00142)
Keywords: generative
Abstract: Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

Title: MPQ-Diff: Mixed Precision Quantization for Diffusion Models

Authors: Rocco Manz Maruzzelli, Basile Lewandowski, Lydia Y. Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00144
Pdf URL: https://arxiv.org/pdf/2412.00144
Copy Paste: [[2412.00144]] MPQ-Diff: Mixed Precision Quantization for Diffusion Models(https://arxiv.org/abs/2412.00144)
Keywords: diffusion
Abstract: Diffusion models (DMs) generate remarkable high quality images via the stochastic denoising process, which unfortunately incurs high sampling time. Post-quantizing the trained diffusion models in fixed bit-widths, e.g., 4 bits on weights and 8 bits on activation, is shown effective in accelerating sampling time while maintaining the image quality. Motivated by the observation that the cross-layer dependency of DMs vary across layers and sampling steps, we propose a mixed precision quantization scheme, MPQ-Diff, which allocates different bit-width to the weights and activation of the layers. We advocate to use the cross-layer correlation of a given layer, termed network orthogonality metric, as a proxy to measure the relative importance of a layer per sampling step. We further adopt a uniform sampling scheme to avoid the excessive profiling overhead of estimating orthogonality across all time steps. We evaluate the proposed mixed-precision on LSUN and ImageNet, showing a significant improvement in FID from 65.73 to 15.39, and 52.66 to 14.93, compared to their fixed precision quantization, respectively.

Title: Knowledge-Augmented Explainable and Interpretable Learning for Anomaly Detection and Diagnosis

Authors: Martin Atzmueller, Tim Bohne, Patricia Windler
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00146
Pdf URL: https://arxiv.org/pdf/2412.00146
Copy Paste: [[2412.00146]] Knowledge-Augmented Explainable and Interpretable Learning for Anomaly Detection and Diagnosis(https://arxiv.org/abs/2412.00146)
Keywords: anomaly
Abstract: Knowledge-augmented learning enables the combination of knowledge-based and data-driven approaches. For anomaly detection and diagnosis, understandability is typically an important factor, especially in high-risk areas. Therefore, explainability and interpretability are also major criteria in such contexts. This chapter focuses on knowledge-augmented explainable and interpretable learning to enhance understandability, transparency and ultimately computational sensemaking. We exemplify different approaches and methods in the domains of anomaly detection and diagnosis - from comparatively simple interpretable methods towards more advanced neuro-symbolic approaches.

Title: Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

Authors: Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00150
Pdf URL: https://arxiv.org/pdf/2412.00150
Copy Paste: [[2412.00150]] Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise(https://arxiv.org/abs/2412.00150)
Keywords: foundation model
Abstract: Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.

Title: VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Authors: Taesung Kwon, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00156
Pdf URL: https://arxiv.org/pdf/2412.00156
Copy Paste: [[2412.00156]] VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models(https://arxiv.org/abs/2412.00156)
Keywords: diffusion
Abstract: In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present batch-consistent inversion, an initialization technique that incorporates informative latents from the measurement frame. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a single NVIDIA 4090 GPU. Project page: this https URL.

Title: AerialGo: Walking-through City View Generation from Aerial Perspectives

Authors: Fuqiang Zhao, Yijing Guo, Siyuan Yang, Xi Chen, Luo Wang, Lan Xu, Yingliang Zhang, Yujiao Shi, Jingyi Yu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00157
Pdf URL: https://arxiv.org/pdf/2412.00157
Copy Paste: [[2412.00157]] AerialGo: Walking-through City View Generation from Aerial Perspectives(https://arxiv.org/abs/2412.00157)
Keywords: diffusion, generative
Abstract: High-quality 3D urban reconstruction is essential for applications in urban planning, navigation, and AR/VR. However, capturing detailed ground-level data across cities is both labor-intensive and raises significant privacy concerns related to sensitive information, such as vehicle plates, faces, and other personal identifiers. To address these challenges, we propose AerialGo, a novel framework that generates realistic walking-through city views from aerial images, leveraging multi-view diffusion models to achieve scalable, photorealistic urban reconstructions without direct ground-level data collection. By conditioning ground-view synthesis on accessible aerial data, AerialGo bypasses the privacy risks inherent in ground-level imagery. To support the model training, we introduce AerialGo dataset, a large-scale dataset containing diverse aerial and ground-view images, paired with camera and depth information, designed to support generative urban reconstruction. Experiments show that AerialGo significantly enhances ground-level realism and structural coherence, providing a privacy-conscious, scalable solution for city-scale 3D modeling.

Title: Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning

Authors: Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
Subjects: cs.CV, cs.LG, cs.SD, eess.AS, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00175
Pdf URL: https://arxiv.org/pdf/2412.00175
Copy Paste: [[2412.00175]] Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning(https://arxiv.org/abs/2412.00175)
Keywords: self-supervised
Abstract: Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.

Title: Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

Authors: Hui Ren, Joanna Materzynska, Rohit Gandikota, David Bau, Antonio Torralba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00176
Pdf URL: https://arxiv.org/pdf/2412.00176
Copy Paste: [[2412.00176]] Art-Free Generative Models: Art Creation Without Graphic Art Knowledge(https://arxiv.org/abs/2412.00176)
Keywords: generative
Abstract: We explore the question: "How much prior art knowledge is needed to create art?" To investigate this, we propose a text-to-image generation model trained without access to art-related content. We then introduce a simple yet effective method to learn an art adapter using only a few examples of selected artistic styles. Our experiments show that art generated using our method is perceived by users as comparable to art produced by models trained on large, art-rich datasets. Finally, through data attribution techniques, we illustrate how examples from both artistic and non-artistic datasets contributed to the creation of new artistic styles.

Title: LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Authors: Xiaoyan Xing, Konrad Groh, Sezer Karagolu, Theo Gevers, Anand Bhattad
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00177
Pdf URL: https://arxiv.org/pdf/2412.00177
Copy Paste: [[2412.00177]] LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting(https://arxiv.org/abs/2412.00177)
Keywords: diffusion, generative
Abstract: We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.

Title: Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation

Authors: Michele De Vita, Vasileios Belagiannis
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00205
Pdf URL: https://arxiv.org/pdf/2412.00205
Copy Paste: [[2412.00205]] Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation(https://arxiv.org/abs/2412.00205)
Keywords: diffusion, generative
Abstract: Despite the remarkable progress in generative modelling, current diffusion models lack a quantitative approach to assess image quality. To address this limitation, we propose to estimate the pixel-wise aleatoric uncertainty during the sampling phase of diffusion models and utilise the uncertainty to improve the sample generation quality. The uncertainty is computed as the variance of the denoising scores with a perturbation scheme that is specifically designed for diffusion models. We then show that the aleatoric uncertainty estimates are related to the second-order derivative of the diffusion noise distribution. We evaluate our uncertainty estimation algorithm and the uncertainty-guided sampling on the ImageNet and CIFAR-10 datasets. In our comparisons with the related work, we demonstrate promising results in filtering out low quality samples. Furthermore, we show that our guided approach leads to better sample generation in terms of FID scores.

Title: Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

Authors: Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian Price, Scott Cohen, Jianming Zhang, Daniel Aliaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00306
Pdf URL: https://arxiv.org/pdf/2412.00306
Copy Paste: [[2412.00306]] Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment(https://arxiv.org/abs/2412.00306)
Keywords: diffusion, generative
Abstract: Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.

Title: Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach

Authors: Feiyang Liu, Dan Guo, Jingyuan Xu, Zihao He, Shengeng Tang, Kun Li, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00309
Pdf URL: https://arxiv.org/pdf/2412.00309
Copy Paste: [[2412.00309]] Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach(https://arxiv.org/abs/2412.00309)
Keywords: foundation model
Abstract: Following the gaze of other people and analyzing the target they are looking at can help us understand what they are thinking, and doing, and predict the actions that may follow. Existing methods for gaze following struggle to perform well in natural scenes with diverse objects, and focus on gaze points rather than objects, making it difficult to deliver clear semantics and accurate scope of the targets. To address this shortcoming, we propose a novel gaze target prediction solution named GazeSeg, that can fully utilize the spatial visual field of the person as guiding information and lead to a progressively coarse-to-fine gaze target segmentation and recognition process. Specifically, a prompt-based visual foundation model serves as the encoder, working in conjunction with three distinct decoding modules (e.g. FoV perception, heatmap generation, and segmentation) to form the framework for gaze target prediction. Then, with the head bounding box performed as an initial prompt, GazeSeg obtains the FoV map, heatmap, and segmentation map progressively, leading to a unified framework for multiple tasks (e.g. direction estimation, gaze target segmentation, and recognition). In particular, to facilitate this research, we construct and release a new dataset, comprising 72k images with pixel-level annotations and 270 categories of gaze targets, built upon the GazeFollow dataset. The quantitative evaluation shows that our approach achieves the Dice of 0.325 in gaze target segmentation and 71.7% top-5 recognition. Meanwhile, our approach also outperforms previous state-of-the-art methods, achieving 0.953 in AUC on the gaze-following task. The dataset and code will be released.

Title: Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Authors: Wei Zhou, Lei Zhao, Runyu Zhang, Yifan Cui, Hongpu Huang, Kun Qie, Chen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00348
Pdf URL: https://arxiv.org/pdf/2412.00348
Copy Paste: [[2412.00348]] Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey(https://arxiv.org/abs/2412.00348)
Keywords: foundation model, anomaly
Abstract: Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies, remains lacking. This paper presents a systematic review of vision-based technologies in TSS, examining both low-level perception tasks (object detection, classification, and tracking) and high-level perception applications (parameter estimation, anomaly detection, and behavior understanding). Specifically, we first provide a detailed methodological categorization and comprehensive performance evaluation for each task. Our investigation reveals five fundamental limitations in current TSS: perceptual data degradation in complex scenarios, data-driven learning constraints, semantic understanding gaps, sensing coverage limitations and computational resource demands. To address these challenges, we systematically analyze five categories of potential solutions: advanced perception enhancement, efficient learning paradigms, knowledge-enhanced understanding, cooperative sensing frameworks and efficient computing frameworks. Furthermore, we evaluate the transformative potential of foundation models in TSS, demonstrating their unique capabilities in zero-shot learning, semantic understanding, and scene generation. This review provides a unified framework bridging low-level and high-level perception tasks, systematically analyzes current limitations and solutions, and presents a structured roadmap for integrating emerging technologies, particularly foundation models, to enhance TSS capabilities.

Title: DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation

Authors: Zhaoxing Gan, Guangnan Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00381
Pdf URL: https://arxiv.org/pdf/2412.00381
Copy Paste: [[2412.00381]] DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation(https://arxiv.org/abs/2412.00381)
Keywords: diffusion, generative
Abstract: Layout Generation aims to synthesize plausible arrangements from given elements. Currently, the predominant methods in layout generation are Generative Adversarial Networks (GANs) and diffusion models, each presenting its own set of challenges. GANs typically struggle with handling discrete data due to their requirement for differentiable generated samples and have historically circumvented the direct generation of discrete labels by treating them as fixed conditions. Conversely, diffusion-based models, despite achieving state-of-the-art performance across several metrics, require extensive sampling steps which lead to significant time costs. To address these limitations, we propose \textbf{DogLayout} (\textbf{D}en\textbf{o}ising Diffusion \textbf{G}AN \textbf{Layout} model), which integrates a diffusion process into GANs to enable the generation of discrete label data and significantly reduce diffusion's sampling time. Experiments demonstrate that DogLayout considerably reduces sampling costs by up to 175 times and cuts overlap from 16.43 to 9.59 compared to existing diffusion models, while also surpassing GAN based and other layout methods. Code is available at this https URL.

Title: On Foundation Models for Dynamical Systems from Purely Synthetic Data

Authors: Martin Ziegler, Andres Felipe Posada-Moreno, Friedrich Solowjow, Sebastian Trimpe
Subjects: cs.LG, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00395
Pdf URL: https://arxiv.org/pdf/2412.00395
Copy Paste: [[2412.00395]] On Foundation Models for Dynamical Systems from Purely Synthetic Data(https://arxiv.org/abs/2412.00395)
Keywords: foundation model
Abstract: Foundation models have demonstrated remarkable generalization, data efficiency, and robustness properties across various domains. In this paper, we explore the feasibility of foundation models for applications in the control domain. The success of these models is enabled by large-scale pretaining on Internet-scale datasets. These are available in fields like natural language processing and computer vision, but do not exist for dynamical systems. We address this challenge by pretraining a transformer-based foundation model exclusively on synthetic data and propose to sample dynamics functions from a reproducing kernel Hilbert space. Our pretrained model generalizes for prediction tasks across different dynamical systems, which we validate in simulation and hardware experiments, including cart-pole and Furuta pendulum setups. Additionally, the model can be fine-tuned effectively to new systems to increase performance even further. Our results demonstrate the feasibility of foundation models for dynamical systems that outperform specialist models in terms of generalization, data efficiency, and robustness.

Title: DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Authors: Yatian Pang, Bin Zhu, Bin Lin, Mingzhe Zheng, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00397
Pdf URL: https://arxiv.org/pdf/2412.00397
Copy Paste: [[2412.00397]] DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses(https://arxiv.org/abs/2412.00397)
Keywords: diffusion
Abstract: In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Existing approaches struggle with generating coherent, high-quality content in an efficient and user-friendly manner. Concretely, baseline methods relying on only 2D pose guidance lack the cues of 3D information, leading to suboptimal results, while methods using 3D representation as guidance achieve higher quality but involve a cumbersome and time-intensive process. To address these limitations, DreamDance enriches 3D geometry cues from 2D poses by introducing an efficient diffusion model, enabling high-quality human image animation with various guidance. Our key insight is that human images naturally exhibit multiple levels of correlation, progressing from coarse skeleton poses to fine-grained geometry cues, and further from these geometry cues to explicit appearance details. Capturing such correlations could enrich the guidance signals, facilitating intra-frame coherency and inter-frame consistency. Specifically, we construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations, including human pose, depth, and normal maps. Next, we introduce a Mutually Aligned Geometry Diffusion Model to generate fine-grained depth and normal maps for enriched guidance. Finally, a Cross-domain Controller incorporates multi-level guidance to animate human images effectively with a video diffusion model. Extensive experiments demonstrate that our method achieves state-of-the-art performance in animating human images.

Title: FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai, Yi-Lun Wu, Hong-Han Shuai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00427
Pdf URL: https://arxiv.org/pdf/2412.00427
Copy Paste: [[2412.00427]] FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting(https://arxiv.org/abs/2412.00427)
Keywords: diffusion
Abstract: In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.

Title: A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge

Authors: Atharva Deshpande, Kaushik Gopalan, Jeet Shah, Hrishikesh Simu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00451
Pdf URL: https://arxiv.org/pdf/2412.00451
Copy Paste: [[2412.00451]] A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge(https://arxiv.org/abs/2412.00451)
Keywords: generative
Abstract: This study explores the application of deep learning for rainfall prediction, leveraging the Spinning Enhanced Visible and Infrared Imager (SEVIRI) High rate information transmission (HRIT) data as input and the Operational Program on the Exchange of weather RAdar information (OPERA) ground-radar reflectivity data as ground truth. We use the mean of 4 InfraRed frequency channels as the input. The radiance images are forecasted up to 4 hours into the future using a dense optical flow algorithm. A conditional generative adversarial network (GAN) model is employed to transform the predicted radiance images into rainfall images which are aggregated over the 4 hour forecast period to generate cumulative rainfall values. This model scored a value of approximately 7.5 as the Continuous Ranked Probability Score (CRPS) in the Weather4Cast 2024 competition and placed 1st on the core challenge leaderboard.

Title: Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

Authors: Jona Ballé, Luca Versari, Emilien Dupont, Hyunjik Kim, Matthias Bauer
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00505
Pdf URL: https://arxiv.org/pdf/2412.00505
Copy Paste: [[2412.00505]] Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion(https://arxiv.org/abs/2412.00505)
Keywords: generative
Abstract: Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.

Title: Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Authors: Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2412.00508
Pdf URL: https://arxiv.org/pdf/2412.00508
Copy Paste: [[2412.00508]] Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence(https://arxiv.org/abs/2412.00508)
Keywords: generative
Abstract: Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

Title: Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects

Authors: Amir Barda, Matheus Gadelha, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, Thibault Groueix
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00518
Pdf URL: https://arxiv.org/pdf/2412.00518
Copy Paste: [[2412.00518]] Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects(https://arxiv.org/abs/2412.00518)
Keywords: diffusion, generative
Abstract: We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in approximately 3 seconds, without the need for running an SDS type of optimization. Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.

Title: Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

Authors: Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00557
Pdf URL: https://arxiv.org/pdf/2412.00557
Copy Paste: [[2412.00557]] Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion(https://arxiv.org/abs/2412.00557)
Keywords: diffusion
Abstract: Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.

Title: Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection

Authors: Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00560
Pdf URL: https://arxiv.org/pdf/2412.00560
Copy Paste: [[2412.00560]] Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection(https://arxiv.org/abs/2412.00560)
Keywords: anomaly
Abstract: Overfitting has long been stigmatized as detrimental to model performance, especially in the context of anomaly detection. Our work challenges this conventional view by introducing a paradigm shift, recasting overfitting as a controllable and strategic mechanism for enhancing model discrimination capabilities. In this paper, we present Controllable Overfitting-based Anomaly Detection (COAD), a novel framework designed to leverage overfitting for optimized anomaly detection. We propose the Aberrance Retention Quotient (ARQ), a novel metric that systematically quantifies the extent of overfitting, enabling the identification of an optimal "golden overfitting interval." Within this interval, overfitting is leveraged to significantly amplify the model's sensitivity to anomalous patterns, while preserving generalization to normal samples. Additionally, we present the Relative Anomaly Distribution Index (RADI), an innovative metric designed to complement AUROC pixel by providing a more versatile and theoretically robust framework for assessing model performance. RADI leverages ARQ to track and evaluate how overfitting impacts anomaly detection, offering an integrated approach to understanding the relationship between overfitting dynamics and model efficacy. Our theoretical work also rigorously validates the use of Gaussian noise in pseudo anomaly synthesis, providing the foundation for its broader applicability across diverse domains. Empirical evaluations demonstrate that our controllable overfitting method not only achieves State of the Art (SOTA) performance in both one-class and multi-class anomaly detection tasks but also redefines overfitting from a modeling challenge into a powerful tool for optimizing anomaly detection.

Title: Continuous Concepts Removal in Text-to-image Diffusion Models

Authors: Tingxu Han, Weisong Sun, Yanrong Hu, Chunrong Fang, Yonglong Zhang, Shiqing Ma, Tao Zheng, Zhenyu Chen, Zhenting Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00580
Pdf URL: https://arxiv.org/pdf/2412.00580
Copy Paste: [[2412.00580]] Continuous Concepts Removal in Text-to-image Diffusion Models(https://arxiv.org/abs/2412.00580)
Keywords: diffusion
Abstract: Text-to-image diffusion models have shown an impressive ability to generate high-quality images from input textual descriptions. However, concerns have been raised about the potential for these models to create content that infringes on copyrights or depicts disturbing subject matter. Removing specific concepts from these models is a promising potential solution to this problem. However, existing methods for concept removal do not work well in practical but challenging scenarios where concepts need to be continuously removed. Specifically, these methods lead to poor alignment between the text prompts and the generated image after the continuous removal process. To address this issue, we propose a novel approach called CCRT that includes a designed knowledge distillation paradigm. It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts generated through our genetic algorithm, which employs a designed fuzzing strategy. We conduct extensive experiments involving the removal of various concepts. The results evaluated through both algorithmic metrics and human studies demonstrate that our CCRT can effectively remove the targeted concepts in a continuous manner while maintaining the high generation quality (e.g., text-image alignment) of the model.

Title: Generative LiDAR Editing with Controllable Novel Object Layouts

Authors: Shing-Hei Ho, Bao Thach, Minghan Zhu
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00592
Pdf URL: https://arxiv.org/pdf/2412.00592
Copy Paste: [[2412.00592]] Generative LiDAR Editing with Controllable Novel Object Layouts(https://arxiv.org/abs/2412.00592)
Keywords: generative
Abstract: We propose a framework to edit real-world Lidar scans with novel object layouts while preserving a realistic background environment. Compared to the synthetic data generation frameworks where Lidar point clouds are generated from scratch, our framework focuses on new scenario generation in a given background environment, and our method also provides labels for the generated data. This approach ensures the generated data remains relevant to the specific environment, aiding both the development and the evaluation of algorithms in real-world scenarios. Compared with novel view synthesis, our framework allows the creation of counterfactual scenarios with significant changes in the object layout and does not rely on multi-frame optimization. In our framework, the object removal and insertion are supported by generative background inpainting and object point cloud completion, and the entire pipeline is built upon spherical voxelization, which realizes the correct Lidar projective geometry by construction. Experiments show that our framework generates realistic Lidar scans with object layout changes and benefits the development of Lidar-based self-driving systems.

Title: PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Authors: Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00596
Pdf URL: https://arxiv.org/pdf/2412.00596
Copy Paste: [[2412.00596]] PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation(https://arxiv.org/abs/2412.00596)
Keywords: diffusion
Abstract: Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: this https URL.

Title: A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Authors: Chensheng Peng, Ido Sobol, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu, Or Litany
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00623
Pdf URL: https://arxiv.org/pdf/2412.00623
Copy Paste: [[2412.00623]] A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision(https://arxiv.org/abs/2412.00623)
Keywords: diffusion, generative
Abstract: We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available. Experimental results on object-level and scene-level datasets demonstrate the effectiveness of our framework.

Title: Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis

Authors: Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00638
Pdf URL: https://arxiv.org/pdf/2412.00638
Copy Paste: [[2412.00638]] Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis(https://arxiv.org/abs/2412.00638)
Keywords: diffusion
Abstract: Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.

Title: Improving Decoupled Posterior Sampling for Inverse Problems using Data Consistency Constraint

Authors: Zhi Qi, Shihong Yuan, Yuyin Yuan, Linling Kuang, Yoshiyuki Kabashima, Xiangming Meng
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00664
Pdf URL: https://arxiv.org/pdf/2412.00664
Copy Paste: [[2412.00664]] Improving Decoupled Posterior Sampling for Inverse Problems using Data Consistency Constraint(https://arxiv.org/abs/2412.00664)
Keywords: diffusion
Abstract: Diffusion models have shown strong performances in solving inverse problems through posterior sampling while they suffer from errors during earlier steps. To mitigate this issue, several Decoupled Posterior Sampling methods have been recently proposed. However, the reverse process in these methods ignores measurement information, leading to errors that impede effective optimization in subsequent steps. To solve this problem, we propose Guided Decoupled Posterior Sampling (GDPS) by integrating a data consistency constraint in the reverse process. The constraint performs a smoother transition within the optimization process, facilitating a more effective convergence toward the target distribution. Furthermore, we extend our method to latent diffusion models and Tweedie's formula, demonstrating its scalability. We evaluate GDPS on the FFHQ and ImageNet datasets across various linear and nonlinear tasks under both standard and challenging conditions. Experimental results demonstrate that GDPS achieves state-of-the-art performance, improving accuracy over existing methods.

Title: Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection

Authors: Yingjian Chen, Lei Zhang, Yakun Niu, Lei Tan, Pei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00665
Pdf URL: https://arxiv.org/pdf/2412.00665
Copy Paste: [[2412.00665]] Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection(https://arxiv.org/abs/2412.00665)
Keywords: diffusion
Abstract: Diffusion Models enable realistic image generation, raising the risk of misinformation and eroding public trust. Currently, detecting images generated by unseen diffusion models remains challenging due to the limited generalization capabilities of existing methods. To address this issue, we rethink the effectiveness of pre-trained models trained on large-scale, real-world images. Our findings indicate that: 1) Pre-trained models can cluster the features of real images effectively. 2) Models with pre-trained weights can approximate an optimal generalization solution at a specific training step, but it is extremely unstable. Based on these facts, we propose a simple yet effective training method called Learning on Less (LoL). LoL utilizes a random masking mechanism to constrain the model's learning of the unique patterns specific to a certain type of diffusion model, allowing it to focus on less image content. This leverages the inherent strengths of pre-trained weights while enabling a more stable approach to optimal generalization, which results in the extraction of a universal feature that differentiates various diffusion-generated images from real images. Extensive experiments on the GenImage benchmark demonstrate the remarkable generalization capability of our proposed LoL. With just 1% training data, LoL significantly outperforms the current state-of-the-art, achieving a 13.6% improvement in average ACC across images generated by eight different models.

Title: FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

Authors: Yunpeng Bai, Qixing Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00671
Pdf URL: https://arxiv.org/pdf/2412.00671
Copy Paste: [[2412.00671]] FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation(https://arxiv.org/abs/2412.00671)
Keywords: diffusion, generative
Abstract: Monocular Depth Estimation (MDE) is essential for applications like 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust MDE remains challenging due to noisy real-world data and distribution gaps in synthetic datasets. Existing methods often struggle with low efficiency, reduced accuracy, and lack of detail. To address this, we propose an efficient approach for leveraging diffusion priors and introduce FiffDepth, a framework that transforms diffusion-based image generators into a feedforward architecture for detailed depth estimation. By preserving key generative features and integrating the strong generalization capabilities of models like dinov2, FiffDepth achieves enhanced accuracy, stability, and fine-grained detail, offering a significant improvement in MDE performance across diverse real-world scenarios.

Title: Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

Authors: Zilin Du, Haoxin Li, Jianfei Yu, Boyang Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00684
Pdf URL: https://arxiv.org/pdf/2412.00684
Copy Paste: [[2412.00684]] Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding(https://arxiv.org/abs/2412.00684)
Keywords: generative
Abstract: Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address data scarcity, we propose a novel framework, POBF (Paint Outside the Box, then Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to identify the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Experimental results show that POBF achieves superior performance across four datasets, delivering an average improvement of 5.83% and outperforming leading baselines by 2.29% to 3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, data ratios, and model architectures.

Title: Enhancing the Generalization Capability of Skin Lesion Classification Models with Active Domain Adaptation Methods

Authors: Jun Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00702
Pdf URL: https://arxiv.org/pdf/2412.00702
Copy Paste: [[2412.00702]] Enhancing the Generalization Capability of Skin Lesion Classification Models with Active Domain Adaptation Methods(https://arxiv.org/abs/2412.00702)
Keywords: self-supervised
Abstract: We propose a method to improve the generalization ability of skin lesion classification models by combining self-supervised learning (SSL), unsupervised domain adaptation (UDA), and active domain adaptation (ADA). The main steps of the approach include selection of a SSL pretrained model on natural image datasets, subsequent SSL retraining on all available skin lesion datasets, finetuning of the model on source domain data with labels, application of UDA methods on target domain data, and lastly, implementation of ADA methods. The efficacy of the proposed approach is assessed across ten skin lesion datasets of domains, demonstrating its potential for enhancing the performance of skin lesion classification models. This approach holds promise for facilitating the widespread adoption of medical imaging models in clinical settings, thereby amplifying their impact.

Title: Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Authors: Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00733
Pdf URL: https://arxiv.org/pdf/2412.00733
Copy Paste: [[2412.00733]] Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks(https://arxiv.org/abs/2412.00733)
Keywords: diffusion, generative
Abstract: Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: this https URL.

Title: CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images

Authors: Jian Liu, Zhen Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00754
Pdf URL: https://arxiv.org/pdf/2412.00754
Copy Paste: [[2412.00754]] CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images(https://arxiv.org/abs/2412.00754)
Keywords: generative
Abstract: The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.

Title: DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

Authors: Xin Xie, Dong Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00759
Pdf URL: https://arxiv.org/pdf/2412.00759
Copy Paste: [[2412.00759]] DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling(https://arxiv.org/abs/2412.00759)
Keywords: diffusion
Abstract: Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.

Title: Learning to Forget using Hypernetworks

Authors: Jose Miguel Lara Rangel, Stefan Schoepf, Jack Foster, David Krueger, Usman Anwar
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.00761
Pdf URL: https://arxiv.org/pdf/2412.00761
Copy Paste: [[2412.00761]] Learning to Forget using Hypernetworks(https://arxiv.org/abs/2412.00761)
Keywords: diffusion
Abstract: Machine unlearning is gaining increasing attention as a way to remove adversarial data poisoning attacks from already trained models and to comply with privacy and AI regulations. The objective is to unlearn the effect of undesired data from a trained model while maintaining performance on the remaining data. This paper introduces HyperForget, a novel machine unlearning framework that leverages hypernetworks - neural networks that generate parameters for other networks - to dynamically sample models that lack knowledge of targeted data while preserving essential capabilities. Leveraging diffusion models, we implement two Diffusion HyperForget Networks and used them to sample unlearned models in Proof-of-Concept experiments. The unlearned models obtained zero accuracy on the forget set, while preserving good accuracy on the retain sets, highlighting the potential of HyperForget for dynamic targeted data removal and a promising direction for developing adaptive machine unlearning algorithms.

Title: PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis

Authors: Hao Dong, Wei Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00763
Pdf URL: https://arxiv.org/pdf/2412.00763
Copy Paste: [[2412.00763]] PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis(https://arxiv.org/abs/2412.00763)
Keywords: generative
Abstract: Recently, generative pre-training based models have demonstrated remarkable results on Aspect-based Sentiment Analysis (ABSA) task. However, previous works overemphasize crafting various templates to paraphrase training targets for enhanced decoding, ignoring the internal optimizations on generative models. Despite notable results achieved by these target-oriented optimization methods, they struggle with the complicated long texts since the implicit long-distance relation, e.g., aspect-opinion relation, is difficult to extract under the position embedding mechanism in generative models. Thus, in this paper, we first clarify the causes of the problem and introduce two sequence optimization strategies: the rule-based static optimization and the score-based dynamic optimization. The rule-based approach relies on handcraft priority of dependency relation to reorder the context, while the score-based algorithm dynamically regulates the contextual sequence by calculating word position scores using neural network. Based on the dynamic optimization structure, we further propose a unified Prompt-based Generative Sequence Optimization network (named PGSO), which jointly optimizes the training target as well as the generative model. Specifically, PGSO contains two components, namely, prompt construction and sequence regulator. The former constructs a task-specific prompt based on unsupervised training objects to fully utilize the pre-trained model. The latter jointly leverages semantic, syntactic and original-sequence information to dynamically regulate contextual sequence. Our experiments conducted on four ABSA tasks across multiple benchmarks indicate that PGSO outperforms state-of-the-art methods, with an average improvement of 3.52% in F1 score.

Title: Explorations in Self-Supervised Learning: Dataset Composition Testing for Object Classification

Authors: Raynor Kirkson E. Chavez, Kyle Gabriel M. Reynoso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00770
Pdf URL: https://arxiv.org/pdf/2412.00770
Copy Paste: [[2412.00770]] Explorations in Self-Supervised Learning: Dataset Composition Testing for Object Classification(https://arxiv.org/abs/2412.00770)
Keywords: self-supervised
Abstract: This paper investigates the impact of sampling and pretraining using datasets with different image characteristics on the performance of self-supervised learning (SSL) models for object classification. To do this, we sample two apartment datasets from the Omnidata platform based on modality, luminosity, image size, and camera field of view and use them to pretrain a SimCLR model. The encodings generated from the pretrained model are then transferred to a supervised Resnet-50 model for object classification. Through A/B testing, we find that depth pretrained models are more effective on low resolution images, while RGB pretrained models perform better on higher resolution images. We also discover that increasing the luminosity of training images can improve the performance of models on low resolution images without negatively affecting their performance on higher resolution images.

Title: DIVD: Deblurring with Improved Video Diffusion Model

Authors: Haoyang Long, Yan Wang, Wendong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00773
Pdf URL: https://arxiv.org/pdf/2412.00773
Copy Paste: [[2412.00773]] DIVD: Deblurring with Improved Video Diffusion Model(https://arxiv.org/abs/2412.00773)
Keywords: diffusion
Abstract: Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.

Title: Memories of Forgotten Concepts

Authors: Matan Rusanovsky, Shimon Malnick, Amir Jevnisek, Ohad Fried, Shai Avidan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00782
Pdf URL: https://arxiv.org/pdf/2412.00782
Copy Paste: [[2412.00782]] Memories of Forgotten Concepts(https://arxiv.org/abs/2412.00782)
Keywords: diffusion
Abstract: Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts. In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts. Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept. We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept. Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.

Title: EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

Authors: Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00784
Pdf URL: https://arxiv.org/pdf/2412.00784
Copy Paste: [[2412.00784]] EDTformer: An Efficient Decoder Transformer for Visual Place Recognition(https://arxiv.org/abs/2412.00784)
Keywords: foundation model
Abstract: Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability in capturing contextual dependencies and generating accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly generate robust and discriminative global representations for VPR. Specifically, we do this by formulating deep features as the keys and values, as well as a set of independent learnable parameters as the queries. EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to form the final global representations. Moreover, to provide powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-Rank Parallel Adaptation (LoPA) method to enhance it, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at this https URL.

Title: Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach

Authors: Jingyi Zhao, Yuxuan Ou, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI, q-bio.BM, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.00807
Pdf URL: https://arxiv.org/pdf/2412.00807
Copy Paste: [[2412.00807]] Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach(https://arxiv.org/abs/2412.00807)
Keywords: generative
Abstract: Ionizable lipids are essential in developing lipid nanoparticles (LNPs) for effective messenger RNA (mRNA) delivery. While traditional methods for designing new ionizable lipids are typically time-consuming, deep generative models have emerged as a powerful solution, significantly accelerating the molecular discovery process. However, a practical challenge arises as the molecular structures generated can often be difficult or infeasible to synthesize. This project explores Monte Carlo tree search (MCTS)-based generative models for synthesizable ionizable lipids. Leveraging a synthetically accessible lipid building block dataset and two specialized predictors to guide the search through chemical space, we introduce a policy network guided MCTS generative model capable of producing new ionizable lipids with available synthesis pathways.

Title: Categorical Keypoint Positional Embedding for Robust Animal Re-Identification

Authors: Yuhao Lin, Lingqiao Liu, Javen Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00818
Pdf URL: https://arxiv.org/pdf/2412.00818
Copy Paste: [[2412.00818]] Categorical Keypoint Positional Embedding for Robust Animal Re-Identification(https://arxiv.org/abs/2412.00818)
Keywords: diffusion
Abstract: Animal re-identification (ReID) has become an indispensable tool in ecological research, playing a critical role in tracking population dynamics, analyzing behavioral patterns, and assessing ecological impacts, all of which are vital for informed conservation strategies. Unlike human ReID, animal ReID faces significant challenges due to the high variability in animal poses, diverse environmental conditions, and the inability to directly apply pre-trained models to animal data, making the identification process across species more complex. This work introduces an innovative keypoint propagation mechanism, which utilizes a single annotated image and a pre-trained diffusion model to propagate keypoints across an entire dataset, significantly reducing the cost of manual annotation. Additionally, we enhance the Vision Transformer (ViT) by implementing Keypoint Positional Encoding (KPE) and Categorical Keypoint Positional Embedding (CKPE), enabling the ViT to learn more robust and semantically-aware representations. This provides more comprehensive and detailed keypoint representations, leading to more accurate and efficient re-identification. Our extensive experimental evaluations demonstrate that this approach significantly outperforms existing state-of-the-art methods across four wildlife datasets. The code will be publicly released.

Title: Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models

Authors: Christian Möller, Niklas Funk, Jan Peters
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00835
Pdf URL: https://arxiv.org/pdf/2412.00835
Copy Paste: [[2412.00835]] Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models(https://arxiv.org/abs/2412.00835)
Keywords: diffusion, generative
Abstract: Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at this https URL .

Title: AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

Authors: Jin Lyu, Tianyi Zhu, Yi Gu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang, Liang An
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00837
Pdf URL: https://arxiv.org/pdf/2412.00837
Copy Paste: [[2412.00837]] AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer(https://arxiv.org/abs/2412.00837)
Keywords: diffusion
Abstract: Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications. The project page of AniMer is this https URL.

Title: Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion

Authors: Bohai Gu, Hao Luo, Song Guo, Peiran Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00857
Pdf URL: https://arxiv.org/pdf/2412.00857
Copy Paste: [[2412.00857]] Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion(https://arxiv.org/abs/2412.00857)
Keywords: diffusion
Abstract: Recently, diffusion-based methods have achieved great improvements in the video inpainting task. However, these methods still face many challenges, such as maintaining temporal consistency and the time-consuming issue. This paper proposes an advanced video inpainting framework using optical Flow-guided Efficient Diffusion, called FloED. Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch. Additionally, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. Further introducing a flow attention cache mechanism, FLoED efficiently reduces the computational cost brought by incorporating optical flow. Comprehensive experiments in both background restoration and object removal tasks demonstrate that FloED outperforms state-of-the-art methods from the perspective of both performance and efficiency.

Title: Deep evolving semi-supervised anomaly detection

Authors: Jack Belham, Aryan Bhosale, Samrat Mukherjee, Biplab Banerjee, Fabio Cuzzolin
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00860
Pdf URL: https://arxiv.org/pdf/2412.00860
Copy Paste: [[2412.00860]] Deep evolving semi-supervised anomaly detection(https://arxiv.org/abs/2412.00860)
Keywords: generative, anomaly
Abstract: The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD), with the aim of highlighting the importance of such a problem formulation which assumes as close to real-world conditions as possible. After an overview of the relevant definitions of continual semi-supervised learning, its components, anomaly detection extension, and the training protocols; the paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection. The results show that such a use of extreme value theory (EVT) applied to anomaly detection can provide promising results even in comparison to an upper baseline of joint training. The results explore the effects of how much labelled and unlabelled data is present, of which class, and where it is located in the data stream. Outlier rejection shows promising initial results where it often surpasses a baseline method of Elastic Weight Consolidation (EWC). A baseline for CSAD is put forward along with the specific dataset setups used for reproducability and testability for other practitioners. Future research directions include other CSAD settings and further research into efficient continual hyperparameter tuning.

Title: Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration

Authors: Haoze Sun, Wenbo Li, Jiayue Liu, Kaiwen Zhou, Yongqiang Chen, Yong Guo, Yanwei Li, Renjing Pei, Long Peng, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00878
Pdf URL: https://arxiv.org/pdf/2412.00878
Copy Paste: [[2412.00878]] Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration(https://arxiv.org/abs/2412.00878)
Keywords: diffusion, generative
Abstract: Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.

Title: Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

Authors: Kun Qian, Tianyu Sun, Wenhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00890
Pdf URL: https://arxiv.org/pdf/2412.00890
Copy Paste: [[2412.00890]] Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection(https://arxiv.org/abs/2412.00890)
Keywords: anomaly
Abstract: Industrial anomaly detection (IAD) plays a crucial role in the maintenance and quality control of manufacturing processes. In this paper, we propose a novel approach, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), which leverages large vision-language models (LVLMs) to improve both anomaly detection and localization in industrial settings. CLAD aligns visual and textual features into a shared embedding space using contrastive learning, ensuring that normal instances are grouped together while anomalies are pushed apart. Through extensive experiments on two benchmark industrial datasets, MVTec-AD and VisA, we demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Additionally, we provide ablation studies and human evaluation to validate the importance of key components in our method. Our approach not only achieves superior performance but also enhances interpretability by accurately localizing anomalies, making it a promising solution for real-world industrial applications.

Title: A Deep Generative Model for the Design of Synthesizable Ionizable Lipids

Authors: Yuxuan Ou, Jingyi Zhao, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00928
Pdf URL: https://arxiv.org/pdf/2412.00928
Copy Paste: [[2412.00928]] A Deep Generative Model for the Design of Synthesizable Ionizable Lipids(https://arxiv.org/abs/2412.00928)
Keywords: generative
Abstract: Lipid nanoparticles (LNPs) are vital in modern biomedicine, enabling the effective delivery of mRNA for vaccines and therapies by protecting it from rapid degradation. Among the components of LNPs, ionizable lipids play a key role in RNA protection and facilitate its delivery into the cytoplasm. However, designing ionizable lipids is complex. Deep generative models can accelerate this process and explore a larger candidate space compared to traditional methods. Due to the structural differences between lipids and small molecules, existing generative models used for small molecule generation are unsuitable for lipid generation. To address this, we developed a deep generative model specifically tailored for the discovery of ionizable lipids. Our model generates novel ionizable lipid structures and provides synthesis paths using synthetically accessible building blocks, addressing synthesizability. This advancement holds promise for streamlining the development of lipid-based delivery systems, potentially accelerating the deployment of new therapeutic agents, including mRNA vaccines and gene therapies.

Title: STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Authors: Nicholas Lenzen, Amogh Raut, Andrew Melnik
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00949
Pdf URL: https://arxiv.org/pdf/2412.00949
Copy Paste: [[2412.00949]] STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft(https://arxiv.org/abs/2412.00949)
Keywords: foundation model, generative
Abstract: Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

Title: WAFFLE: Multimodal Floorplan Understanding in the Wild

Authors: Keren Ganon, Morris Alper, Rachel Mikulinsky, Hadar Averbuch-Elor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00955
Pdf URL: https://arxiv.org/pdf/2412.00955
Copy Paste: [[2412.00955]] WAFFLE: Multimodal Floorplan Understanding in the Wild(https://arxiv.org/abs/2412.00955)
Keywords: foundation model, generative
Abstract: Buildings are a central feature of human culture and are increasingly being analyzed with computational methods. However, recent works on computational building understanding have largely focused on natural imagery of buildings, neglecting the fundamental element defining a building's structure -- its floorplan. Conversely, existing works on floorplan understanding are extremely limited in scope, often focusing on floorplans of a single semantic category and region (e.g. floorplans of apartments from a single country). In this work, we introduce WAFFLE, a novel multimodal floorplan understanding dataset of nearly 20K floorplan images and metadata curated from Internet data spanning diverse building types, locations, and data formats. By using a large language model and multimodal foundation models, we curate and extract semantic information from these images and their accompanying noisy metadata. We show that WAFFLE enables progress on new building understanding tasks, both discriminative and generative, which were not feasible using prior datasets. We will publicly release WAFFLE along with our code and trained models, providing the research community with a new foundation for learning the semantics of buildings.

Title: Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Authors: Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, Hidenori Tanaka
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01003
Pdf URL: https://arxiv.org/pdf/2412.01003
Copy Paste: [[2412.01003]] Competition Dynamics Shape Algorithmic Phases of In-Context Learning(https://arxiv.org/abs/2412.01003)
Keywords: in-context
Abstract: In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to novel tasks using merely the inputted context. This has motivated a series of papers that analyze tractable synthetic domains and postulate precise mechanisms that may underlie ICL. However, the use of relatively distinct setups that often lack a sequence modeling nature to them makes it unclear how general the reported insights from such studies are. Motivated by this, we propose a synthetic sequence modeling task that involves learning to simulate a finite mixture of Markov chains. As we show, models trained on this task reproduce most well-known results on ICL, hence offering a unified setting for studying the concept. Building on this setup, we demonstrate we can explain a model's behavior by decomposing it into four broad algorithms that combine a fuzzy retrieval vs. inference approach with either unigram or bigram statistics of the context. These algorithms engage in a competition dynamics to dominate model behavior, with the precise experimental conditions dictating which algorithm ends up superseding others: e.g., we find merely varying context size or amount of training yields (at times sharp) transitions between which algorithm dictates the model behavior, revealing a mechanism that explains the transient nature of ICL. In this sense, we argue ICL is best thought of as a mixture of different algorithms, each with its own peculiarities, instead of a monolithic capability. This also implies that making general claims about ICL that hold universally across all settings may be infeasible.

Title: Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Authors: Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01027
Pdf URL: https://arxiv.org/pdf/2412.01027
Copy Paste: [[2412.01027]] Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation(https://arxiv.org/abs/2412.01027)
Keywords: diffusion, in-context
Abstract: Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

Title: Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings

Authors: Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, Tanveer Syeda-Mahmood
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.01031
Pdf URL: https://arxiv.org/pdf/2412.01031
Copy Paste: [[2412.01031]] Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings(https://arxiv.org/abs/2412.01031)
Keywords: generative
Abstract: Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on a gold standard dataset derived from the MIMIC collection and show its robustness and sensitivity to factual errors.

Title: CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

Authors: Jingnan Shi, Rajat Talak, Harry Zhang, David Jin, Luca Carlone
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01052
Pdf URL: https://arxiv.org/pdf/2412.01052
Copy Paste: [[2412.01052]] CRISP: Object Pose and Shape Estimation with Test-Time Adaptation(https://arxiv.org/abs/2412.01052)
Keywords: self-supervised
Abstract: We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects. Code and pre-trained models will be available on this https URL.

Title: FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Authors: Taekyung Ki, Dongchan Min, Gyoungsu Chae
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01064
Pdf URL: https://arxiv.org/pdf/2412.01064
Copy Paste: [[2412.01064]] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait(https://arxiv.org/abs/2412.01064)
Keywords: diffusion, generative
Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

Title: DuoCast: Duo-Probabilistic Meteorology-Aware Model for Extended Precipitation Nowcasting

Authors: Penghui Wen, Lei Bai, Mengwei He, Patrick Filippi, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01091
Pdf URL: https://arxiv.org/pdf/2412.01091
Copy Paste: [[2412.01091]] DuoCast: Duo-Probabilistic Meteorology-Aware Model for Extended Precipitation Nowcasting(https://arxiv.org/abs/2412.01091)
Keywords: diffusion
Abstract: Recently, extended short-term precipitation nowcasting struggles with decreasing precision because of insufficient consideration of meteorological knowledge, such as weather fronts which significantly influence precipitation intensity, duration, and spatial distribution. Therefore, in this paper, we present DuoCast, a novel dual-probabilistic meteorology-aware model designed to address both broad weather evolution and micro-scale fluctuations using two diffusion models, PrecipFlow and MicroDynamic, respectively. Our PrecipFlow model captures evolution trends through an Extreme Precipitation-Aware Encoder (EPA-Encoder), which includes AirConvolution and FrontAttention blocks to process two levels of precipitation data: general and extreme. The output conditions a UNet-based diffusion to produce prediction maps enriched with weather front information. The MicroDynamic model further refines the results to capture micro-scale variability. Extensive experiments on four public benchmarks demonstrate the effectiveness of our DuoCast, achieving superior performance over state-of-the-art methods. Our code is available at this https URL.

Title: One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Authors: Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong Zhang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.01106
Pdf URL: https://arxiv.org/pdf/2412.01106
Copy Paste: [[2412.01106]] One Shot, One Talk: Whole-body Talking Avatar from a Single Image(https://arxiv.org/abs/2412.01106)
Keywords: diffusion
Abstract: Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Title: Multi-Scale Representation Learning for Protein Fitness Prediction

Authors: Zuobai Zhang, Pascal Notin, Yining Huang, Aurélie Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2412.01108
Pdf URL: https://arxiv.org/pdf/2412.01108
Copy Paste: [[2412.01108]] Multi-Scale Representation Learning for Protein Fitness Prediction(https://arxiv.org/abs/2412.01108)
Keywords: self-supervised
Abstract: Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at this https URL.

Title: DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

Authors: Hao Wu, Zhihang Zhong, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01115
Pdf URL: https://arxiv.org/pdf/2412.01115
Copy Paste: [[2412.01115]] DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding(https://arxiv.org/abs/2412.01115)
Keywords: diffusion
Abstract: Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.

Title: Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM

Authors: Alejandro Fontan, Javier Civera, Tobias Fischer, Michael Milford
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01116
Pdf URL: https://arxiv.org/pdf/2412.01116
Copy Paste: [[2412.01116]] Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM(https://arxiv.org/abs/2412.01116)
Keywords: self-supervised, generative
Abstract: Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth -- a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability to real-world scenarios. In this work, we propose a novel ground-truth-free (GTF) evaluation methodology that eliminates the need for geometric ground truth, instead using sensitivity estimation via sampling from both original and noisy versions of input images. Our approach shows strong correlation with traditional ground-truth-based benchmarks and supports GTF hyperparameter tuning. Removing the need for ground truth opens up new opportunities to leverage a much larger number of dataset sources, and for self-supervised and online tuning, with the potential for a data-driven breakthrough analogous to what has occurred in generative AI.

Title: LoyalDiffusion: A Diffusion Model Guarding Against Data Replication

Authors: Chenghao Li, Yuke Zhang, Dake Chen, Jingqi Xu, Peter A. Beerel
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01118
Pdf URL: https://arxiv.org/pdf/2412.01118
Copy Paste: [[2412.01118]] LoyalDiffusion: A Diffusion Model Guarding Against Data Replication(https://arxiv.org/abs/2412.01118)
Keywords: diffusion
Abstract: Diffusion models have demonstrated significant potential in image generation. However, their ability to replicate training data presents a privacy risk, particularly when the training data includes confidential information. Existing mitigation strategies primarily focus on augmenting the training dataset, leaving the impact of diffusion model architecture under explored. In this paper, we address this gap by examining and mitigating the impact of the model structure, specifically the skip connections in the diffusion model's U-Net model. We first present our observation on a trade-off in the skip connections. While they enhance image generation quality, they also reinforce the memorization of training data, increasing the risk of replication. To address this, we propose a replication-aware U-Net (RAU-Net) architecture that incorporates information transfer blocks into skip connections that are less essential for image quality. Recognizing the potential impact of RAU-Net on generation quality, we further investigate and identify specific timesteps during which the impact on memorization is most pronounced. By applying RAU-Net selectively at these critical timesteps, we couple our novel diffusion model with a targeted training and inference strategy, forming a framework we refer to as LoyalDiffusion. Extensive experiments demonstrate that LoyalDiffusion outperforms the state-of-the-art replication mitigation method achieving a 48.63% reduction in replication while maintaining comparable image quality.

Title: Referring Video Object Segmentation via Language-aligned Track Selection

Authors: Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01136
Pdf URL: https://arxiv.org/pdf/2412.01136
Copy Paste: [[2412.01136]] Referring Video Object Segmentation via Language-aligned Track Selection(https://arxiv.org/abs/2412.01136)
Keywords: foundation model
Abstract: Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at this https URL

Title: TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Authors: Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01137
Pdf URL: https://arxiv.org/pdf/2412.01137
Copy Paste: [[2412.01137]] TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition(https://arxiv.org/abs/2412.01137)
Keywords: diffusion
Abstract: Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at this https URL.

Title: R.I.P.: A Simple Black-box Attack on Continual Test-time Adaptation

Authors: Trung-Hieu Hoang, Duc Minh Vo, Minh N. Do
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01154
Pdf URL: https://arxiv.org/pdf/2412.01154
Copy Paste: [[2412.01154]] R.I.P.: A Simple Black-box Attack on Continual Test-time Adaptation(https://arxiv.org/abs/2412.01154)
Keywords: self-supervised
Abstract: Test-time adaptation (TTA) has emerged as a promising solution to tackle the continual domain shift in machine learning by allowing model parameters to change at test time, via self-supervised learning on unlabeled testing data. At the same time, it unfortunately opens the door to unforeseen vulnerabilities for degradation over time. Through a simple theoretical continual TTA model, we successfully identify a risk in the sampling process of testing data that could easily degrade the performance of a continual TTA model. We name this risk as Reusing of Incorrect Prediction (RIP) that TTA attackers can employ or as a result of the unintended query from general TTA users. The risk posed by RIP is also highly realistic, as it does not require prior knowledge of model parameters or modification of testing samples. This simple requirement makes RIP as the first black-box TTA attack algorithm that stands out from existing white-box attempts. We extensively benchmark the performance of the most recent continual TTA approaches when facing the RIP attack, providing insights on its success, and laying out potential roadmaps that could enhance the resilience of future continual TTA systems.

Title: Graph Community Augmentation with GMM-based Modeling in Latent Space

Authors: Shintaro Fukushima, Kenji Yamanishi
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01163
Pdf URL: https://arxiv.org/pdf/2412.01163
Copy Paste: [[2412.01163]] Graph Community Augmentation with GMM-based Modeling in Latent Space(https://arxiv.org/abs/2412.01163)
Keywords: generative
Abstract: This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community. There is a chance of discovering an unseen but important structure of graphs with a new community, for example, in a social network such as a purchaser network. Graph community augmentation may also be helpful for generalization of data mining models in a case where it is difficult to collect real graph data enough. In fact, there are many ways to generate a new community in an existing graph. It is desirable to discover a new graph with a new community beyond the given graph while we keep the structure of the original graphs to some extent for the generated graphs to be realistic. To this end, we propose an algorithm called the graph community augmentation (GCA). The key ideas of GCA are (i) to fit Gaussian mixture model (GMM) to data points in the latent space into which the nodes in the original graph are embedded, and (ii) to add data points in the new cluster in the latent space for generating a new community based on the minimum description length (MDL) principle. We empirically demonstrate the effectiveness of GCA for generating graphs with a new community structure on synthetic and real datasets.

Title: Rectified Flow For Structure Based Drug Design

Authors: Daiheng Zhang, Chengyue Gong, Qiang Liu
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2412.01174
Pdf URL: https://arxiv.org/pdf/2412.01174
Copy Paste: [[2412.01174]] Rectified Flow For Structure Based Drug Design(https://arxiv.org/abs/2412.01174)
Keywords: diffusion, generative
Abstract: Deep generative models have achieved tremendous success in structure-based drug design in recent years, especially for generating 3D ligand molecules that bind to specific protein pocket. Notably, diffusion models have transformed ligand generation by providing exceptional quality and creativity. However, traditional diffusion models are restricted by their conventional learning objectives, which limit their broader applicability. In this work, we propose a new framework FlowSBDD, which is based on rectified flow model, allows us to flexibly incorporate additional loss to optimize specific target and introduce additional condition either as an extra input condition or replacing the initial Gaussian distribution. Extensive experiments on CrossDocked2020 show that our approach could achieve state-of-the-art performance on generating high-affinity molecules while maintaining proper molecular properties without specifically designing binding site, with up to -8.50 Avg. Vina Dock score and 75.0% Diversity.

Title: OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

Authors: Zijian Chen, Tingzhu Chen, Wenjun Zhang, Guangtao Zhai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01175
Pdf URL: https://arxiv.org/pdf/2412.01175
Copy Paste: [[2412.01175]] OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?(https://arxiv.org/abs/2412.01175)
Keywords: foundation model
Abstract: We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single character, and handprinted character. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts. The evaluation of 6 proprietary LMMs as well as 17 open-source LMMs highlights the substantial challenges and demands posed by OBI-Bench. Even the latest versions of GPT-4o, Gemini 1.5 Pro, and Qwen-VL-Max are still far from public-level humans in some fine-grained perception tasks. However, they perform at a level comparable to untrained humans in deciphering task, indicating remarkable capabilities in offering new interpretative perspectives and generating creative guesses. We hope OBI-Bench can facilitate the community to develop domain-specific multi-modal foundation models towards ancient language research and delve deeper to discover and enhance these untapped potentials of LMMs.

Title: MeasureNet: Measurement Based Celiac Disease Identification

Authors: Aayush Kumar Tyagi, Vaibhav Mishra, Ashok Tiwari, Lalita Mehra, Prasenjit Das, Govind Makharia, Prathosh AP, Mausam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01182
Pdf URL: https://arxiv.org/pdf/2412.01182
Copy Paste: [[2412.01182]] MeasureNet: Measurement Based Celiac Disease Identification(https://arxiv.org/abs/2412.01182)
Keywords: generative
Abstract: Celiac disease is an autoimmune disorder triggered by the consumption of gluten. It causes damage to the villi, the finger-like projections in the small intestine that are responsible for nutrient absorption. Additionally, the crypts, which form the base of the villi, are also affected, impairing the regenerative process. The deterioration in villi length, computed as the villi-to-crypt length ratio, indicates the severity of celiac disease. However, manual measurement of villi-crypt length can be both time-consuming and susceptible to inter-observer variability, leading to inconsistencies in diagnosis. While some methods can perform measurement as a post-hoc process, they are prone to errors in the initial stages. This gap underscores the need for pathologically driven solutions that enhance measurement accuracy and reduce human error in celiac disease assessments. Our proposed method, MeasureNet, is a pathologically driven polyline detection framework incorporating polyline localization and object-driven losses specifically designed for measurement tasks. Furthermore, we leverage segmentation model to provide auxiliary guidance about crypt location when crypt are partially visible. To ensure that model is not overdependent on segmentation mask we enhance model robustness through a mask feature mixup technique. Additionally, we introduce a novel dataset for grading celiac disease, consisting of 750 annotated duodenum biopsy images. MeasureNet achieves an 82.66% classification accuracy for binary classification and 81% accuracy for multi-class grading of celiac disease. Code: this https URL

Title: MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry

Authors: Kurukulasooriya Fernando ana Gianluca Demartini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01189
Pdf URL: https://arxiv.org/pdf/2412.01189
Copy Paste: [[2412.01189]] MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry(https://arxiv.org/abs/2412.01189)
Keywords: generative
Abstract: Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14\% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.

Title: TinyFusion: Diffusion Transformers Learned Shallow

Authors: Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01199
Pdf URL: https://arxiv.org/pdf/2412.01199
Copy Paste: [[2412.01199]] TinyFusion: Diffusion Transformers Learned Shallow(https://arxiv.org/abs/2412.01199)
Keywords: diffusion
Abstract: Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$\times$ speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at this https URL.

Title: Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Authors: Wenxin Su, Song Tang, Xiaofeng Liu, Xiaojing Yi, Mao Ye, Chunxiao Zu, Jiahao Li, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01203
Pdf URL: https://arxiv.org/pdf/2412.01203
Copy Paste: [[2412.01203]] Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data(https://arxiv.org/abs/2412.01203)
Keywords: generative
Abstract: Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way--learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.

Title: PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Authors: Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01223
Pdf URL: https://arxiv.org/pdf/2412.01223
Copy Paste: [[2412.01223]] PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control(https://arxiv.org/abs/2412.01223)
Keywords: diffusion
Abstract: Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.

Title: Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Authors: Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Huchuan Lu, Jinsong Ouyang, Georges El Fakhri, Xiaofeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01240
Pdf URL: https://arxiv.org/pdf/2412.01240
Copy Paste: [[2412.01240]] Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes(https://arxiv.org/abs/2412.01240)
Keywords: in-context
Abstract: As a foundational model, SAM has significantly influenced multiple fields within computer vision, and its upgraded version, SAM 2, enhances capabilities in video segmentation, poised to make a substantial impact once again. While SAMs (SAM and SAM 2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, product defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough quantitative evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM 2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM 2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks. This work aims to provide valuable insights to guide future research in both context-independent and context-dependent concepts segmentation, potentially informing the development of the next version - SAM 3.

Title: Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Authors: Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01243
Pdf URL: https://arxiv.org/pdf/2412.01243
Copy Paste: [[2412.01243]] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation(https://arxiv.org/abs/2412.01243)
Keywords: diffusion
Abstract: Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts. In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning, aiming to maximize a reward that discounts the final image quality by the number of denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.

Title: Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

Authors: Lingyun Zhang, Yu Xie, Yanwei Fu, Ping Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01244
Pdf URL: https://arxiv.org/pdf/2412.01244
Copy Paste: [[2412.01244]] Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization(https://arxiv.org/abs/2412.01244)
Keywords: diffusion
Abstract: As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.

Title: Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

Authors: Jinouwen Zhang, Rongkun Xue, Yazhe Niu, Yun Chen, Jing Yang, Hongsheng Li, Yu Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01245
Pdf URL: https://arxiv.org/pdf/2412.01245
Copy Paste: [[2412.01245]] Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective(https://arxiv.org/abs/2412.01245)
Keywords: diffusion, generative
Abstract: Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

Title: EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Authors: Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01254
Pdf URL: https://arxiv.org/pdf/2412.01254
Copy Paste: [[2412.01254]] EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation(https://arxiv.org/abs/2412.01254)
Keywords: diffusion
Abstract: This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

Title: NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Authors: Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01256
Pdf URL: https://arxiv.org/pdf/2412.01256
Copy Paste: [[2412.01256]] NLPrompt: Noise-Label Prompt Learning for Vision-Language Models(https://arxiv.org/abs/2412.01256)
Keywords: foundation model
Abstract: The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representation and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

Title: Indexing Economic Fluctuation Narratives from Keiki Watchers Survey

Authors: Eriko Shigetsugu, Hiroki Sakaji, Itsuki Noda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01265
Pdf URL: https://arxiv.org/pdf/2412.01265
Copy Paste: [[2412.01265]] Indexing Economic Fluctuation Narratives from Keiki Watchers Survey(https://arxiv.org/abs/2412.01265)
Keywords: diffusion
Abstract: In this paper, we design indices of economic fluctuation narratives derived from economic surveys. Companies, governments, and investors rely on key metrics like GDP and industrial production indices to predict economic trends. However, they have yet to effectively leverage the wealth of information contained in economic text, such as causal relationships, in their economic forecasting. Therefore, we design indices of economic fluctuation from economic surveys by using our previously proposed narrative framework. From the evaluation results, it is observed that the proposed indices had a stronger correlation with cumulative lagging diffusion index than other types of diffusion indices.

Title: MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Authors: Sen Xing, Muyan Zhong, Zeqiang Lai, Liangchen Li, Jiawen Liu, Yaohui Wang, Jifeng Dai, Wenhai Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01271
Pdf URL: https://arxiv.org/pdf/2412.01271
Copy Paste: [[2412.01271]] MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost(https://arxiv.org/abs/2412.01271)
Keywords: diffusion
Abstract: In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (38.61 for English vs. 37.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

Title: MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Authors: Shan Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01284
Pdf URL: https://arxiv.org/pdf/2412.01284
Copy Paste: [[2412.01284]] MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model(https://arxiv.org/abs/2412.01284)
Keywords: diffusion
Abstract: Text-to-image generation models have become transformative tools. However, diffusion-based vision language models still lack the ability to precisely control the shape, appearance, and positional placement of objects in generated images using text guidance alone. Global image editing models typically achieve global layout control by relying on additional masks or images as guidance, which often require model training. Although local object-editing models enable modification of object shapes, they do not provide control over the positional placement of these objects. To address these limitations, we propose the MFTF model, which enables precise control over object positioning without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional control (such as translation, rotation, etc.) and allows for concurrent layout control and object semantic editing. This is achieved by controlling the denoising process of the diffusion model through parallel denoising. Attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries are then modified according to layout control parameters and injected back into the self-attention layers of the target diffusion model to enable precise positional control.

Title: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Authors: Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01316
Pdf URL: https://arxiv.org/pdf/2412.01316
Copy Paste: [[2412.01316]] Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation(https://arxiv.org/abs/2412.01316)
Keywords: diffusion
Abstract: We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: this https URL.

Title: Negative Token Merging: Image-based Adversarial Feature Guidance

Authors: Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01339
Pdf URL: https://arxiv.org/pdf/2412.01339
Copy Paste: [[2412.01339]] Negative Token Merging: Image-based Adversarial Feature Guidance(https://arxiv.org/abs/2412.01339)
Keywords: diffusion
Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at this https URL

Title: MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Authors: Xiaomin Li, Xu Jia, Qinghe Wang, Haiwen Diao, Mengmeng Ge, Pengxiang Li, You He, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01343
Pdf URL: https://arxiv.org/pdf/2412.01343
Copy Paste: [[2412.01343]] MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models(https://arxiv.org/abs/2412.01343)
Keywords: diffusion
Abstract: Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.

Title: An overview of diffusion models for generative artificial intelligence

Authors: Davide Gallon, Arnulf Jentzen, Philippe von Wurstemberger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01371
Pdf URL: https://arxiv.org/pdf/2412.01371
Copy Paste: [[2412.01371]] An overview of diffusion models for generative artificial intelligence(https://arxiv.org/abs/2412.01371)
Keywords: diffusion, generative
Abstract: This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.

Title: Hierarchical VAE with a Diffusion-based VampPrior

Authors: Anna Kuzina, Jakub M. Tomczak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01373
Pdf URL: https://arxiv.org/pdf/2412.01373
Copy Paste: [[2412.01373]] Hierarchical VAE with a Diffusion-based VampPrior(https://arxiv.org/abs/2412.01373)
Keywords: diffusion, generative
Abstract: Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.

Title: Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data

Authors: Ivan DeAndres-Tame, Ruben Tolosana, Pietro Melzi, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Luis F. Gomez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zhizhou Zhong, Yuge Huang, Yuxi Mi, Shouhong Ding, Shuigeng Zhou, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Zhihong Xiao, Evgeny Smirnov, Anton Pimenov, Aleksei Grigorev, Denis Timoshenko, Kaleb Mesfin Asfaw, Cheng Yaw Low, Hao Liu, Chuyi Wang, Qing Zuo, Zhixiang He, Hatef Otroshi Shahreza, Anjith George, Alexander Unnervik, Parsa Rahimi, Sébastien Marcel, Pedro C. Neto, Marco Huber, Jan Niklas Kolf, Naser Damer, Fadi Boutros, Jaime S. Cardoso, Ana F. Sequeira, Andrea Atzori, Gianni Fenu, Mirko Marras, Vitomir Štruc, Jiang Yu, Zhangjie Li, Jichun Li, Weisong Zhao, Zhen Lei, Xiangyu Zhu, Xiao-Yu Zhang, Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01383
Pdf URL: https://arxiv.org/pdf/2412.01383
Copy Paste: [[2412.01383]] Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data(https://arxiv.org/abs/2412.01383)
Keywords: generative
Abstract: Synthetic data is gaining increasing popularity for face recognition technologies, mainly due to the privacy concerns and challenges associated with obtaining real data, including diverse scenarios, quality, and demographic groups, among others. It also offers some advantages over real data, such as the large amount of data that can be generated or the ability to customize it to adapt to specific problem-solving needs. To effectively use such data, face recognition models should also be specifically designed to exploit synthetic data to its fullest potential. In order to promote the proposal of novel Generative AI methods and synthetic data, and investigate the application of synthetic data to better train face recognition systems, we introduce the 2nd FRCSyn-onGoing challenge, based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. This is an ongoing challenge that provides researchers with an accessible platform to benchmark i) the proposal of novel Generative AI methods and synthetic data, and ii) novel face recognition systems that are specifically proposed to take advantage of synthetic data. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition such as demographic bias, domain adaptation, and performance constraints in demanding situations, such as age disparities between training and testing, changes in the pose, or occlusions. Very interesting findings are obtained in this second edition, including a direct comparison with the first one, in which synthetic databases were restricted to DCFace and GANDiffFace.

Title: Machine Learning Analysis of Anomalous Diffusion

Authors: Wenjie Cai, Yi Hu, Xiang Qu, Hui Zhao, Gongyi Wang, Jing Li, Zihan Huang
Subjects: cs.LG, cond-mat.soft, physics.bio-ph, physics.data-an
Abstract URL: https://arxiv.org/abs/2412.01393
Pdf URL: https://arxiv.org/pdf/2412.01393
Copy Paste: [[2412.01393]] Machine Learning Analysis of Anomalous Diffusion(https://arxiv.org/abs/2412.01393)
Keywords: diffusion
Abstract: The rapid advancements in machine learning have made its application to anomalous diffusion analysis both essential and inevitable. This review systematically introduces the integration of machine learning techniques for enhanced analysis of anomalous diffusion, focusing on two pivotal aspects: single trajectory characterization via machine learning and representation learning of anomalous diffusion. We extensively compare various machine learning methods, including both classical machine learning and deep learning, used for the inference of diffusion parameters and trajectory segmentation. Additionally, platforms such as the Anomalous Diffusion Challenge that serve as benchmarks for evaluating these methods are highlighted. On the other hand, we outline three primary strategies for representing anomalous diffusion: the combination of predefined features, the feature vector from the penultimate layer of neural network, and the latent representation from the autoencoder, analyzing their applicability across various scenarios. This investigation paves the way for future research, offering valuable perspectives that can further enrich the study of anomalous diffusion and advance the application of artificial intelligence in statistical physics and biophysics.

Title: HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Authors: Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01407
Pdf URL: https://arxiv.org/pdf/2412.01407
Copy Paste: [[2412.01407]] HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving(https://arxiv.org/abs/2412.01407)
Keywords: generative
Abstract: Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.

Title: FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Authors: Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01427
Pdf URL: https://arxiv.org/pdf/2412.01427
Copy Paste: [[2412.01427]] FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration(https://arxiv.org/abs/2412.01427)
Keywords: diffusion, foundation model
Abstract: Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.

Title: CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01429
Pdf URL: https://arxiv.org/pdf/2412.01429
Copy Paste: [[2412.01429]] CPA: Camera-pose-awareness Diffusion Transformer for Video Generation(https://arxiv.org/abs/2412.01429)
Keywords: diffusion
Abstract: Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.

Title: DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model

Authors: Zhixiang Wang, Guangnan Ye, Xiaosen Wang, Siheng Chen, Zhibo Wang, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01440
Pdf URL: https://arxiv.org/pdf/2412.01440
Copy Paste: [[2412.01440]] DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model(https://arxiv.org/abs/2412.01440)
Keywords: diffusion, generative
Abstract: Physical adversarial patches printed on clothing can easily allow individuals to evade person detectors. However, most existing adversarial patch generation methods prioritize attack effectiveness over stealthiness, resulting in patches that are aesthetically unpleasing. Although existing methods using generative adversarial networks or diffusion models can produce more natural-looking patches, they often struggle to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these challenges, we propose a novel diffusion-based customizable patch generation framework termed DiffPatch, specifically tailored for creating naturalistic and customizable adversarial patches. Our approach enables users to utilize a reference image as the source, rather than starting from random noise, and incorporates masks to craft naturalistic patches of various shapes, not limited to squares. To prevent the original semantics from being lost during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Notably, while maintaining a natural appearance, our method achieves a comparable attack performance to state-of-the-art non-naturalistic patches when using similarly sized attacks. Using DiffPatch, we have created a physical adversarial T-shirt dataset, AdvPatch-1K, specifically targeting YOLOv5s. This dataset includes over a thousand images across diverse scenarios, validating the effectiveness of our attack in real-world environments. Moreover, it provides a valuable resource for future research.

Title: RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications

Authors: Nicholas Konz, Yuwen Chen, Hanxue Gu, Haoyu Dong, Yaqian Chen, Maciej A. Mazurowski
Subjects: cs.CV, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01496
Pdf URL: https://arxiv.org/pdf/2412.01496
Copy Paste: [[2412.01496]] RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications(https://arxiv.org/abs/2412.01496)
Keywords: generative
Abstract: Determining whether two sets of images belong to the same or different domain is a crucial task in modern medical image analysis and deep learning, where domain shift is a common problem that commonly results in decreased model performance. This determination is also important to evaluate the output quality of generative models, e.g., image-to-image translation models used to mitigate domain shift. Current metrics for this either rely on the (potentially biased) choice of some downstream task such as segmentation, or adopt task-independent perceptual metrics (e.g., FID) from natural imaging which insufficiently capture anatomical consistency and realism in medical images. We introduce a new perceptual metric tailored for medical images: Radiomic Feature Distance (RaD), which utilizes standardized, clinically meaningful and interpretable image features. We show that RaD is superior to other metrics for out-of-domain (OOD) detection in a variety of experiments. Furthermore, RaD outperforms previous perceptual metrics (FID, KID, etc.) for image-to-image translation by correlating more strongly with downstream task performance as well as anatomical consistency and realism, and shows similar utility for evaluating unconditional image generation. RaD also offers additional benefits such as interpretability, as well as stability and computational efficiency at low sample sizes. Our results are supported by broad experiments spanning four multi-domain medical image datasets, nine downstream tasks, six image translation models, and other factors, highlighting the broad potential of RaD for medical image analysis.

Title: Structured 3D Latents for Scalable and Versatile 3D Generation

Authors: Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01506
Pdf URL: https://arxiv.org/pdf/2412.01506
Copy Paste: [[2412.01506]] Structured 3D Latents for Scalable and Versatile 3D Generation(https://arxiv.org/abs/2412.01506)
Keywords: foundation model
Abstract: We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

Title: HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition

Authors: Anton Nuzhdin, Alexander Nagaev, Alexander Sautin, Alexander Kapitanov, Karina Kvanchiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01508
Pdf URL: https://arxiv.org/pdf/2412.01508
Copy Paste: [[2412.01508]] HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition(https://arxiv.org/abs/2412.01508)
Keywords: diffusion
Abstract: This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID -- HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID's authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.

Title: GFreeDet: Exploiting Gaussian Splatting and Foundation Models for Model-free Unseen Object Detection in the BOP Challenge 2024

Authors: Xingyu Liu, Yingyue Li, Chengxi Li, Gu Wang, Chenyangguang Zhang, Ziqin Huang, Xiangyang Ji
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01552
Pdf URL: https://arxiv.org/pdf/2412.01552
Copy Paste: [[2412.01552]] GFreeDet: Exploiting Gaussian Splatting and Foundation Models for Model-free Unseen Object Detection in the BOP Challenge 2024(https://arxiv.org/abs/2412.01552)
Keywords: foundation model
Abstract: In this report, we provide the technical details of the submitted method GFreeDet, which exploits Gaussian splatting and vision Foundation models for the model-free unseen object Detection track in the BOP 2024 Challenge.

Title: Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art

Authors: Sebastian Peitz, Sedjro Salomon Hotegni
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.01566
Pdf URL: https://arxiv.org/pdf/2412.01566
Copy Paste: [[2412.01566]] Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art(https://arxiv.org/abs/2412.01566)
Keywords: generative
Abstract: Simultaneously considering multiple objectives in machine learning has been a popular approach for several decades, with various benefits for multi-task learning, the consideration of secondary goals such as sparsity, or multicriteria hyperparameter tuning. However - as multi-objective optimization is significantly more costly than single-objective optimization - the recent focus on deep learning architectures poses considerable additional challenges due to the very large number of parameters, strong nonlinearities and stochasticity. This survey covers recent advancements in the area of multi-objective deep learning. We introduce a taxonomy of existing methods - based on the type of training algorithm as well as the decision maker's needs - before listing recent advancements, and also successful applications. All three main learning paradigms supervised learning, reinforcement learning and unsupervised learning are covered, and we also address the recently very popular area of generative modeling.

Title: 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Authors: Ziyang Yan, Lei Li, Yihua Shao, Siyu Chen, Wuzong Kai, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01583
Pdf URL: https://arxiv.org/pdf/2412.01583
Copy Paste: [[2412.01583]] 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting(https://arxiv.org/abs/2412.01583)
Keywords: diffusion, generative
Abstract: The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input this http URL proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.

Title: OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

Authors: Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01615
Pdf URL: https://arxiv.org/pdf/2412.01615
Copy Paste: [[2412.01615]] OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking(https://arxiv.org/abs/2412.01615)
Keywords: generative
Abstract: With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose OmniGuard, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7% in F1-Score under noisy conditions, and 14.8% in average bit accuracy.

Title: AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation

Authors: Xiaohu Liu, Sascha Hornauer, Fabien Moutarde, Jialiang Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01637
Pdf URL: https://arxiv.org/pdf/2412.01637
Copy Paste: [[2412.01637]] AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation(https://arxiv.org/abs/2412.01637)
Keywords: self-supervised
Abstract: Metric depth prediction from monocular videos suffers from bad generalization between datasets and requires supervised depth data for scale-correct training. Self-supervised training using multi-view reconstruction can benefit from large scale natural videos but not provide correct scale, limiting its benefits. Recently, reflecting audible Echoes off objects is investigated for improved depth prediction and was shown to be sufficient to reconstruct objects at scale even without a visual signal. Because Echoes travel at fixed speed, they have the potential to resolve ambiguities in object scale and appearance. However, predicting depth end-to-end from sound and vision cannot benefit from unsupervised depth prediction approaches, which can process large scale data without sound annotation. In this work we show how Echoes can benefit depth prediction in two ways: When learning metric depth learned from supervised data and as supervisory signal for scale-correct self-supervised training. We show how we can improve the predictions of several state-of-the-art approaches and how the method can scale-correct a self-supervised depth approach.

Title: Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Authors: Varun Belagali, Srikar Yellapragada, Alexandros Graikos, Saarthak Kapse, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01672
Pdf URL: https://arxiv.org/pdf/2412.01672
Copy Paste: [[2412.01672]] Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning(https://arxiv.org/abs/2412.01672)
Keywords: diffusion, self-supervised, generative
Abstract: Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

Title: Diffusion Models with Anisotropic Gaussian Splatting for Image Inpainting

Authors: Jacob Fein-Ashley, Benjamin Fein-Ashley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01682
Pdf URL: https://arxiv.org/pdf/2412.01682
Copy Paste: [[2412.01682]] Diffusion Models with Anisotropic Gaussian Splatting for Image Inpainting(https://arxiv.org/abs/2412.01682)
Keywords: diffusion
Abstract: Image inpainting is a fundamental task in computer vision, aiming to restore missing or corrupted regions in images realistically. While recent deep learning approaches have significantly advanced the state-of-the-art, challenges remain in maintaining structural continuity and generating coherent textures, particularly in large missing areas. Diffusion models have shown promise in generating high-fidelity images but often lack the structural guidance necessary for realistic inpainting. We propose a novel inpainting method that combines diffusion models with anisotropic Gaussian splatting to capture both local structures and global context effectively. By modeling missing regions using anisotropic Gaussian functions that adapt to local image gradients, our approach provides structural guidance to the diffusion-based inpainting network. The Gaussian splat maps are integrated into the diffusion process, enhancing the model's ability to generate high-fidelity and structurally coherent inpainting results. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques, producing visually plausible results with enhanced structural integrity and texture realism.

Title: Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Authors: Zeyu Yang, Zijie Pan, Yuankun Yang, Xiatian Zhu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01717
Pdf URL: https://arxiv.org/pdf/2412.01717
Copy Paste: [[2412.01717]] Driving Scene Synthesis on Free-form Trajectories with Generative Prior(https://arxiv.org/abs/2412.01717)
Keywords: diffusion, generative
Abstract: Driving scene synthesis along free-form trajectories is essential for driving simulations to enable closed-loop evaluation of end-to-end driving policies. While existing methods excel at novel view synthesis on recorded trajectories, they face challenges with novel trajectories due to limited views of driving videos and the vastness of driving environments. To tackle this challenge, we propose a novel free-form driving view synthesis approach, dubbed DriveX, by leveraging video generative prior to optimize a 3D model across a variety of trajectories. Concretely, we crafted an inverse problem that enables a video diffusion model to be utilized as a prior for many-trajectory optimization of a parametric 3D model (e.g., Gaussian splatting). To seamlessly use the generative prior, we iteratively conduct this process during optimization. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory, enabling free-form trajectory driving simulation. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.

Title: LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Authors: Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01720
Pdf URL: https://arxiv.org/pdf/2412.01720
Copy Paste: [[2412.01720]] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant(https://arxiv.org/abs/2412.01720)
Keywords: generative
Abstract: With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.

Title: XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01762
Pdf URL: https://arxiv.org/pdf/2412.01762
Copy Paste: [[2412.01762]] XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation(https://arxiv.org/abs/2412.01762)
Keywords: generative
Abstract: Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.

Title: Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions

Authors: Chaoran Cheng, Boran Han, Danielle C. Maddix, Abdul Fatir Ansari, Andrew Stuart, Michael W. Mahoney, Yuyang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.01786
Pdf URL: https://arxiv.org/pdf/2412.01786
Copy Paste: [[2412.01786]] Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions(https://arxiv.org/abs/2412.01786)
Keywords: generative
Abstract: Generative models that satisfy hard constraints are crucial in many scientific and engineering applications where physical laws or system requirements must be strictly respected. However, many existing constrained generative models, especially those developed for computer vision, rely heavily on gradient information, often sparse or computationally expensive in fields like partial differential equations (PDEs). In this work, we introduce a novel framework for adapting pre-trained, unconstrained flow-matching models to satisfy constraints exactly in a zero-shot manner without requiring expensive gradient computations or fine-tuning. Our framework, ECI sampling, alternates between extrapolation (E), correction (C), and interpolation (I) stages during each iterative sampling step of flow matching sampling to ensure accurate integration of constraint information while preserving the validity of the generation. We demonstrate the effectiveness of our approach across various PDE systems, showing that ECI-guided generation strictly adheres to physical constraints and accurately captures complex distribution shifts induced by these constraints. Empirical results demonstrate that our framework consistently outperforms baseline approaches in various zero-shot constrained generation tasks and also achieves competitive results in the regression tasks without additional fine-tuning.

Title: Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Authors: Rongkun Xue, Jinouwen Zhang, Yazhe Niu, Dazhong Shen, Bingqi Ma, Yu Liu, Jing Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01787
Pdf URL: https://arxiv.org/pdf/2412.01787
Copy Paste: [[2412.01787]] Pretrained Reversible Generation as Unsupervised Visual Representation Learning(https://arxiv.org/abs/2412.01787)
Keywords: generative
Abstract: Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous flow model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model-based methods, including 78\% top-1 accuracy on ImageNet. Extensive ablation studies further validate the effectiveness of our approach.

Title: CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

Authors: Kai He, Chin-Hsuan Wu, Igor Gilitschenski
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.01792
Pdf URL: https://arxiv.org/pdf/2412.01792
Copy Paste: [[2412.01792]] CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion(https://arxiv.org/abs/2412.01792)
Keywords: diffusion
Abstract: Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to "learn" the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.

Title: IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models

Authors: Khaled Abud, Sergey Lavrushkin, Alexey Kirillov, Dmitriy Vatolin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01794
Pdf URL: https://arxiv.org/pdf/2412.01794
Copy Paste: [[2412.01794]] IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models(https://arxiv.org/abs/2412.01794)
Keywords: diffusion, generative
Abstract: Diffusion-based models have recently transformed conditional image generation, achieving unprecedented fidelity in generating photorealistic and semantically accurate images. However, consistently generating high-quality images remains challenging, partly due to the lack of mechanisms for conditioning outputs on perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. First, we experiment with gradient-based guidance to optimize image quality directly and show this approach has limited generalizability. To address this, we introduce IQA-Adapter, a novel architecture that conditions generation on target quality levels by learning the relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter shifts the distribution of generated images towards a higher-quality subdomain. This approach achieves up to a 10% improvement across multiple objective metrics, as confirmed by a subjective study, while preserving generative diversity and content. Additionally, IQA-Adapter can be used inversely as a degradation model, generating progressively more distorted images when conditioned on lower quality scores. Our quality-aware methods also provide insights into the adversarial robustness of IQA models, underscoring the potential of quality conditioning in generative modeling and the importance of robust IQA methods.

Title: SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Authors: Alexey Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01801
Pdf URL: https://arxiv.org/pdf/2412.01801
Copy Paste: [[2412.01801]] SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation(https://arxiv.org/abs/2412.01801)
Keywords: diffusion
Abstract: We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

Title: COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Authors: Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01814
Pdf URL: https://arxiv.org/pdf/2412.01814
Copy Paste: [[2412.01814]] COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training(https://arxiv.org/abs/2412.01814)
Keywords: self-supervised
Abstract: Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

Title: Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Authors: Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01819
Pdf URL: https://arxiv.org/pdf/2412.01819
Copy Paste: [[2412.01819]] Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis(https://arxiv.org/abs/2412.01819)
Keywords: diffusion
Abstract: This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating ${\sim}11\%$ faster sampling and lower memory usage while also achieving slightly better generation this http URL, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of ${\sim}20\%$ and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to $7{\times}$ faster.

Title: Towards Universal Soccer Video Understanding

Authors: Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01820
Pdf URL: https://arxiv.org/pdf/2412.01820
Copy Paste: [[2412.01820]] Towards Universal Soccer Video Understanding(https://arxiv.org/abs/2412.01820)
Keywords: foundation model
Abstract: As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.

Title: World-consistent Video Diffusion with Explicit 3D Modeling

Authors: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01821
Pdf URL: https://arxiv.org/pdf/2412.01821
Copy Paste: [[2412.01821]] World-consistent Video Diffusion with Explicit 3D Modeling(https://arxiv.org/abs/2412.01821)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

Title: X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Authors: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01824
Pdf URL: https://arxiv.org/pdf/2412.01824
Copy Paste: [[2412.01824]] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(https://arxiv.org/abs/2412.01824)
Keywords: foundation model, in-context
Abstract: In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.