2024-12-16

Title: Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Authors: Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09645
Pdf URL: https://arxiv.org/pdf/2412.09645
Copy Paste: [[2412.09645]] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models(https://arxiv.org/abs/2412.09645)
Keywords: diffusion, generative
Abstract: Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.

Title: From Noise to Nuance: Advances in Deep Generative Image Models

Authors: Benji Peng, Chia Xin Liang, Ziqian Bi, Ming Liu, Yichao Zhang, Tianyang Wang, Keyu Chen, Xinyuan Song, Pohsun Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09656
Pdf URL: https://arxiv.org/pdf/2412.09656
Copy Paste: [[2412.09656]] From Noise to Nuance: Advances in Deep Generative Image Models(https://arxiv.org/abs/2412.09656)
Keywords: diffusion, generative
Abstract: Deep learning-based image generation has undergone a paradigm shift since 2021, marked by fundamental architectural breakthroughs and computational innovations. Through reviewing architectural innovations and empirical results, this paper analyzes the transition from traditional generative methods to advanced architectures, with focus on compute-efficient diffusion models and vision transformer architectures. We examine how recent developments in Stable Diffusion, DALL-E, and consistency models have redefined the capabilities and performance boundaries of image synthesis, while addressing persistent challenges in efficiency and quality. Our analysis focuses on the evolution of latent space representations, cross-attention mechanisms, and parameter-efficient training methodologies that enable accelerated inference under resource constraints. While more efficient training methods enable faster inference, advanced control mechanisms like ControlNet and regional attention systems have simultaneously improved generation precision and content customization. We investigate how enhanced multi-modal understanding and zero-shot generation capabilities are reshaping practical applications across industries. Our analysis demonstrates that despite remarkable advances in generation quality and computational efficiency, critical challenges remain in developing resource-conscious architectures and interpretable generation systems for industrial applications. The paper concludes by mapping promising research directions, including neural architecture optimization and explainable generation frameworks.

Title: Omni-ID: Holistic Identity Representation Designed for Generative Tasks

Authors: Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09694
Pdf URL: https://arxiv.org/pdf/2412.09694
Copy Paste: [[2412.09694]] Omni-ID: Holistic Identity Representation Designed for Generative Tasks(https://arxiv.org/abs/2412.09694)
Keywords: generative
Abstract: We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach uses a few-to-many identity reconstruction training paradigm, where a limited set of input images is used to reconstruct multiple target images of the same individual in various poses and expressions. A multi-decoder framework is further employed to leverage the complementary strengths of diverse decoders during training. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.

Title: CUAL: Continual Uncertainty-aware Active Learner

Authors: Amanda Rios, Ibrahima Ndiour, Parual Datta, Jerry Sydir, Omesh Tickoo, Nilesh Ahuja
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09701
Pdf URL: https://arxiv.org/pdf/2412.09701
Copy Paste: [[2412.09701]] CUAL: Continual Uncertainty-aware Active Learner(https://arxiv.org/abs/2412.09701)
Keywords: foundation model
Abstract: AI deployed in many real-world use cases should be capable of adapting to novelties encountered after deployment. Here, we consider a challenging, under-explored and realistic continual adaptation problem: a deployed AI agent is continuously provided with unlabeled data that may contain not only unseen samples of known classes but also samples from novel (unknown) classes. In such a challenging setting, it has only a tiny labeling budget to query the most informative samples to help it continuously learn. We present a comprehensive solution to this complex problem with our model "CUAL" (Continual Uncertainty-aware Active Learner). CUAL leverages an uncertainty estimation algorithm to prioritize active labeling of ambiguous (uncertain) predicted novel class samples while also simultaneously pseudo-labeling the most certain predictions of each class. Evaluations across multiple datasets, ablations, settings and backbones (e.g. ViT foundation model) demonstrate our method's effectiveness. We will release our code upon acceptance.

Title: Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation

Authors: Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09706
Pdf URL: https://arxiv.org/pdf/2412.09706
Copy Paste: [[2412.09706]] Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation(https://arxiv.org/abs/2412.09706)
Keywords: diffusion, generative
Abstract: Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce IT3A, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, IT3A outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.

Title: The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications

Authors: Binxu Wang, John J. Vastola
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09726
Pdf URL: https://arxiv.org/pdf/2412.09726
Copy Paste: [[2412.09726]] The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications(https://arxiv.org/abs/2412.09726)
Keywords: diffusion
Abstract: By learning the gradient of smoothed data distributions, diffusion models can iteratively generate samples from complex distributions. The learned score function enables their generalization capabilities, but how the learned score relates to the score of the underlying data manifold remains largely unclear. Here, we aim to elucidate this relationship by comparing learned neural scores to the scores of two kinds of analytically tractable distributions: Gaussians and Gaussian mixtures. The simplicity of the Gaussian model makes it theoretically attractive, and we show that it admits a closed-form solution and predicts many qualitative aspects of sample generation dynamics. We claim that the learned neural score is dominated by its linear (Gaussian) approximation for moderate to high noise scales, and supply both theoretical and empirical arguments to support this claim. Moreover, the Gaussian approximation empirically works for a larger range of noise scales than naive theory suggests it should, and is preferentially learned early in training. At smaller noise scales, we observe that learned scores are better described by a coarse-grained (Gaussian mixture) approximation of training data than by the score of the training distribution, a finding consistent with generalization. Our findings enable us to precisely predict the initial phase of trained models' sampling trajectories through their Gaussian approximations. We show that this allows the skipping of the first 15-30% of sampling steps while maintaining high sample quality (with a near state-of-the-art FID score of 1.93 on CIFAR-10 unconditional generation). This forms the foundation of a novel hybrid sampling method, termed analytical teleportation, which can seamlessly integrate with and accelerate existing samplers, including DPM-Solver-v3 and UniPC. Our findings suggest ways to improve the design and training of diffusion models.

Title: Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models

Authors: Faith Johnson, Ryan Meegan, Jack Lowry, Peter Oudemans, Kristin Dana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09739
Pdf URL: https://arxiv.org/pdf/2412.09739
Copy Paste: [[2412.09739]] Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models(https://arxiv.org/abs/2412.09739)
Keywords: foundation model
Abstract: Agricultural domains are being transformed by recent advances in AI and computer vision that support quantitative visual evaluation. Using aerial and ground imaging over a time series, we develop a framework for characterizing the ripening process of cranberry crops, a crucial component for precision agriculture tasks such as comparing crop breeds (high-throughput phenotyping) and detecting disease. Using drone imaging, we capture images from 20 waypoints across multiple bogs, and using ground-based imaging (hand-held camera), we image same bog patch using fixed fiducial markers. Both imaging methods are repeated to gather a multi-week time series spanning the entire growing season. Aerial imaging provides multiple samples to compute a distribution of albedo values. Ground imaging enables tracking of individual berries for a detailed view of berry appearance changes. Using vision transformers (ViT) for feature detection after segmentation, we extract a high dimensional feature descriptor of berry appearance. Interpretability of appearance is critical for plant biologists and cranberry growers to support crop breeding decisions (e.g.\ comparison of berry varieties from breeding programs). For interpretability, we create a 2D manifold of cranberry appearance by using a UMAP dimensionality reduction on ViT features. This projection enables quantification of ripening paths and a useful metric of ripening rate. We demonstrate the comparison of four cranberry varieties based on our ripening assessments. This work is the first of its kind and has future impact for cranberries and for other crops including wine grapes, olives, blueberries, and maize. Aerial and ground datasets are made publicly available.

Title: Toward Foundation Model for Multivariate Wearable Sensing of Physiological Signals

Authors: Yunfei Luo, Yuliang Chen, Asif Salekin, Tauhidur Rahman
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2412.09758
Pdf URL: https://arxiv.org/pdf/2412.09758
Copy Paste: [[2412.09758]] Toward Foundation Model for Multivariate Wearable Sensing of Physiological Signals(https://arxiv.org/abs/2412.09758)
Keywords: foundation model
Abstract: Time-series foundation models have the ability to run inference, mainly forecasting, on any type of time series data, thanks to the informative representations comprising waveform features. Wearable sensing data, on the other hand, contain more variability in both patterns and frequency bands of interest and generally emphasize more on the ability to infer healthcare-related outcomes. The main challenge of crafting a foundation model for wearable sensing physiological signals is to learn generalizable representations that support efficient adaptation across heterogeneous sensing configurations and applications. In this work, we propose NormWear, a step toward such a foundation model, aiming to extract generalized and informative wearable sensing representations. NormWear has been pretrained on a large set of physiological signals, including PPG, ECG, EEG, GSR, and IMU, from various public resources. For a holistic assessment, we perform downstream evaluation on 11 public wearable sensing datasets, spanning 18 applications in the areas of mental health, body state inference, biomarker estimations, and disease risk evaluations. We demonstrate that NormWear achieves a better performance improvement over competitive baselines in general time series foundation modeling. In addition, leveraging a novel representation-alignment-match-based method, we align physiological signals embeddings with text embeddings. This alignment enables our proposed foundation model to perform zero-shot inference, allowing it to generalize to previously unseen wearable signal-based health applications. Finally, we perform nonlinear dynamic analysis on the waveform features extracted by the model at each intermediate layer. This analysis quantifies the model's internal processes, offering clear insights into its behavior and fostering greater trust in its inferences among end users.

Title: CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Authors: Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09799
Pdf URL: https://arxiv.org/pdf/2412.09799
Copy Paste: [[2412.09799]] CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection(https://arxiv.org/abs/2412.09799)
Keywords: foundation model
Abstract: Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.

Title: FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

Authors: Ahmadreza Eslaminia, Adrian Jackson, Beitong Tian, Avi Stern, Hallie Gordon, Rajiv Malhotra, Klara Nahrstedt, Chenhui Shao
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2412.09819
Pdf URL: https://arxiv.org/pdf/2412.09819
Copy Paste: [[2412.09819]] FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks(https://arxiv.org/abs/2412.09819)
Keywords: anomaly
Abstract: Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models' responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench's potential as a foundational tool for advancing research on LLM capabilities in FDM.

Title: Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

Authors: Jun Zheng, Jing Wang, Fuwei Zhao, Xujie Zhang, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09822
Pdf URL: https://arxiv.org/pdf/2412.09822
Copy Paste: [[2412.09822]] Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism(https://arxiv.org/abs/2412.09822)
Keywords: diffusion
Abstract: Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder's capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures.

Title: MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

Authors: Xunnong Xu, Mengying Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09828
Pdf URL: https://arxiv.org/pdf/2412.09828
Copy Paste: [[2412.09828]] MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion(https://arxiv.org/abs/2412.09828)
Keywords: diffusion, generative
Abstract: Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.

Title: Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

Authors: Yujin Choi, Jinseong Park, Junyoung Byun, Jaewook Lee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09842
Pdf URL: https://arxiv.org/pdf/2412.09842
Copy Paste: [[2412.09842]] Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training(https://arxiv.org/abs/2412.09842)
Keywords: diffusion, generative
Abstract: Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.

Title: Learning Structural Causal Models from Ordering: Identifiable Flow Models

Authors: Minh Khoa Le, Kien Do, Truyen Tran
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.09843
Pdf URL: https://arxiv.org/pdf/2412.09843
Copy Paste: [[2412.09843]] Learning Structural Causal Models from Ordering: Identifiable Flow Models(https://arxiv.org/abs/2412.09843)
Keywords: diffusion
Abstract: In this study, we address causal inference when only observational data and a valid causal ordering from the causal graph are available. We introduce a set of flow models that can recover component-wise, invertible transformation of exogenous variables. Our flow-based methods offer flexible model design while maintaining causal consistency regardless of the number of discretization steps. We propose design improvements that enable simultaneous learning of all causal mechanisms and reduce abduction and prediction complexity to linear O(n) relative to the number of layers, independent of the number of causal variables. Empirically, we demonstrate that our method outperforms previous state-of-the-art approaches and delivers consistent performance across a wide range of structural causal models in answering observational, interventional, and counterfactual questions. Additionally, our method achieves a significant reduction in computational time compared to existing diffusion-based techniques, making it practical for large structural causal models.

Title: Real-time Identity Defenses against Malicious Personalization of Diffusion Models

Authors: Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, Chongxuan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09844
Pdf URL: https://arxiv.org/pdf/2412.09844
Copy Paste: [[2412.09844]] Real-time Identity Defenses against Malicious Personalization of Diffusion Models(https://arxiv.org/abs/2412.09844)
Keywords: diffusion
Abstract: Personalized diffusion models, capable of synthesizing highly realistic images based on a few reference portraits, pose substantial social, ethical, and legal risks by enabling identity replication. Existing defense mechanisms rely on computationally intensive adversarial perturbations tailored to individual images, rendering them impractical for real-world deployment. This study introduces Real-time Identity Defender (RID), a neural network designed to generate adversarial perturbations through a single forward pass, bypassing the need for image-specific optimization. RID achieves unprecedented efficiency, with defense times as low as 0.12 seconds on a single GPU (4,400 times faster than leading methods) and 1.1 seconds per image on a standard Intel i9 CPU, making it suitable for edge devices such as smartphones. Despite its efficiency, RID matches state-of-the-art performance across visual and quantitative benchmarks, effectively mitigating identity replication risks. Our analysis reveals that RID's perturbations mimic the efficacy of traditional defenses while exhibiting properties distinct from natural noise, such as Gaussian perturbations. To enhance robustness, we extend RID into an ensemble framework that integrates multiple pre-trained text-to-image diffusion models, ensuring resilience against black-box attacks and post-processing techniques, including JPEG compression and diffusion-based purification.

Title: LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Authors: Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09856
Pdf URL: https://arxiv.org/pdf/2412.09856
Copy Paste: [[2412.09856]] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity(https://arxiv.org/abs/2412.09856)
Keywords: diffusion
Abstract: Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: this https URL.

Title: T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements

Authors: Xiangxi Tian, Jie Shan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09886
Pdf URL: https://arxiv.org/pdf/2412.09886
Copy Paste: [[2412.09886]] T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements(https://arxiv.org/abs/2412.09886)
Keywords: generative
Abstract: Generating continuous environmental models from sparsely sampled data is a critical challenge in spatial modeling, particularly for topography. Traditional spatial interpolation methods often struggle with handling sparse measurements. To address this, we propose a Transformer-based Generative Model for Spatial Interpolation (T-GMSI) using a vision transformer (ViT) architecture for digital elevation model (DEM) generation under sparse conditions. T-GMSI replaces traditional convolution-based methods with ViT for feature extraction and DEM interpolation while incorporating a terrain feature-aware loss function for enhanced accuracy. T-GMSI excels in producing high-quality elevation surfaces from datasets with over 70% sparsity and demonstrates strong transferability across diverse landscapes without fine-tuning. Its performance is validated through extensive experiments, outperforming traditional methods such as ordinary Kriging (OK) and natural neighbor (NN) and a conditional generative adversarial network (CGAN)-based model (CEDGAN). Compared to OK and NN, T-GMSI reduces root mean square error (RMSE) by 40% and 25% on airborne lidar data and by 23% and 10% on spaceborne lidar data. Against CEDGAN, T-GMSI achieves a 20% RMSE improvement on provided DEM data, requiring no fine-tuning. The ability of model on generalizing to large, unseen terrains underscores its transferability and potential applicability beyond topographic modeling. This research establishes T-GMSI as a state-of-the-art solution for spatial interpolation on sparse datasets and highlights its broader utility for other sparse data interpolation challenges.

Title: IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs

Authors: Sosuke Yamao, Natsuki Miyahara, Yuki Harazono, Shun Takeuchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09907
Pdf URL: https://arxiv.org/pdf/2412.09907
Copy Paste: [[2412.09907]] IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs(https://arxiv.org/abs/2412.09907)
Keywords: in-context
Abstract: With the increasing complexity of video data and the need for more efficient long-term temporal understanding, existing long-term video understanding methods often fail to accurately capture and analyze extended video sequences. These methods typically struggle to maintain performance over longer durations and to handle the intricate dependencies within the video content. To address these limitations, we propose a simple yet effective large multi-modal model framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual Compressor (IQViC). The key idea, inspired by humans' selective attention and in-context memory mechanisms, is to introduce a novel visual compressor and incorporate efficient memory management techniques to enhance long-term video question answering. Our framework utilizes IQViC, a transformer-based visual compressor, enabling question-conditioned in-context compression, unlike existing methods that rely on full video visual features. This selectively extracts relevant information, significantly reducing memory token requirements. Through extensive experiments on a new dataset based on InfiniBench for long-term video understanding, and standard benchmarks used for existing methods' evaluation, we demonstrate the effectiveness of our proposed IQViC framework and its superiority over state-of-the-art methods in terms of video understanding accuracy and memory efficiency.

Title: Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images

Authors: Yasamin Medghalchi, Moein Heidari, Clayton Allard, Leonid Sigal, Ilker Hacihaliloglu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09910
Pdf URL: https://arxiv.org/pdf/2412.09910
Copy Paste: [[2412.09910]] Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images(https://arxiv.org/abs/2412.09910)
Keywords: diffusion
Abstract: Deep neural networks (DNNs) offer significant promise for improving breast cancer diagnosis in medical imaging. However, these models are highly susceptible to adversarial attacks--small, imperceptible changes that can mislead classifiers--raising critical concerns about their reliability and security. Traditional attacks rely on fixed-norm perturbations, misaligning with human perception. In contrast, diffusion-based attacks require pre-trained models, demanding substantial data when these models are unavailable, limiting practical use in data-scarce scenarios. In medical imaging, however, this is often unfeasible due to the limited availability of datasets. Building on recent advancements in learnable prompts, we propose Prompt2Perturb (P2P), a novel language-guided attack method capable of generating meaningful attack examples driven by text instructions. During the prompt learning phase, our approach leverages learnable prompts within the text encoder to create subtle, yet impactful, perturbations that remain imperceptible while guiding the model towards targeted outcomes. In contrast to current prompt learning-based approaches, our P2P stands out by directly updating text embeddings, avoiding the need for retraining diffusion models. Further, we leverage the finding that optimizing only the early reverse diffusion steps boosts efficiency while ensuring that the generated adversarial examples incorporate subtle noise, thus preserving ultrasound image quality without introducing noticeable artifacts. We show that our method outperforms state-of-the-art attack techniques across three breast ultrasound datasets in FID and LPIPS. Moreover, the generated images are both more natural in appearance and more effective compared to existing adversarial attacks. Our code will be publicly available this https URL.

Title: All-in-One: Transferring Vision Foundation Models into Stereo Matching

Authors: Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, Yangyang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09912
Pdf URL: https://arxiv.org/pdf/2412.09912
Copy Paste: [[2412.09912]] All-in-One: Transferring Vision Foundation Models into Stereo Matching(https://arxiv.org/abs/2412.09912)
Keywords: foundation model
Abstract: As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

Title: FaceShield: Defending Facial Image against Deepfake Threats

Authors: Jaehwan Jeong, Sumin In, Sieun Kim, Hannie Shin, Jongheon Jeong, Sang Ho Yoon, Jaewook Chung, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09921
Pdf URL: https://arxiv.org/pdf/2412.09921
Copy Paste: [[2412.09921]] FaceShield: Defending Facial Image against Deepfake Threats(https://arxiv.org/abs/2412.09921)
Keywords: diffusion, generative
Abstract: The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity verification is not critical. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel attack strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates attacks on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG distortion. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting applicability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness.

Title: Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization

Authors: Xinhao Zhong, Shuoyang Sun, Xulin Gu, Zhaoyang Xu, Yaowei Wang, Jianlong Wu, Bin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09959
Pdf URL: https://arxiv.org/pdf/2412.09959
Copy Paste: [[2412.09959]] Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization(https://arxiv.org/abs/2412.09959)
Keywords: diffusion
Abstract: Dataset distillation offers an efficient way to reduce memory and computational costs by optimizing a smaller dataset with performance comparable to the full-scale original. However, for large datasets and complex deep networks (e.g., ImageNet-1K with ResNet-101), the extensive optimization space limits performance, reducing its practicality. Recent approaches employ pre-trained diffusion models to generate informative images directly, avoiding pixel-level optimization and achieving notable results. However, these methods often face challenges due to distribution shifts between pre-trained models and target datasets, along with the need for multiple distillation steps across varying settings. To address these issues, we propose a novel framework orthogonal to existing diffusion-based distillation methods, leveraging diffusion models for selection rather than generation. Our method starts by predicting noise generated by the diffusion model based on input images and text prompts (with or without label text), then calculates the corresponding loss for each pair. With the loss differences, we identify distinctive regions of the original images. Additionally, we perform intra-class clustering and ranking on selected patches to maintain diversity constraints. This streamlined framework enables a single-step distillation process, and extensive experiments demonstrate that our approach outperforms state-of-the-art methods across various metrics.

Title: EP-CFG: Energy-Preserving Classifier-Free Guidance

Authors: Kai Zhang, Fujun Luan, Sai Bi, Jianming Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09966
Pdf URL: https://arxiv.org/pdf/2412.09966
Copy Paste: [[2412.09966]] EP-CFG: Energy-Preserving Classifier-Free Guidance(https://arxiv.org/abs/2412.09966)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) is widely used in diffusion models but often introduces over-contrast and over-saturation artifacts at higher guidance strengths. We present EP-CFG (Energy-Preserving Classifier-Free Guidance), which addresses these issues by preserving the energy distribution of the conditional prediction during the guidance process. Our method simply rescales the energy of the guided output to match that of the conditional prediction at each denoising step, with an optional robust variant for improved artifact suppression. Through experiments, we show that EP-CFG maintains natural image quality and preserves details across guidance strengths while retaining CFG's semantic alignment benefits, all with minimal computational overhead.

Title: Object-Focused Data Selection for Dense Prediction Tasks

Authors: Niclas Popp, Dan Zhang, Jan Hendrik Metzen, Matthias Hein, Lukas Schott
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10032
Pdf URL: https://arxiv.org/pdf/2412.10032
Copy Paste: [[2412.10032]] Object-Focused Data Selection for Dense Prediction Tasks(https://arxiv.org/abs/2412.10032)
Keywords: foundation model
Abstract: Dense prediction tasks such as object detection and segmentation require high-quality labels at pixel level, which are costly to obtain. Recent advances in foundation models have enabled the generation of autolabels, which we find to be competitive but not yet sufficient to fully replace human annotations, especially for more complex datasets. Thus, we consider the challenge of selecting a representative subset of images for labeling from a large pool of unlabeled images under a constrained annotation budget. This task is further complicated by imbalanced class distributions, as rare classes are often underrepresented in selected subsets. We propose object-focused data selection (OFDS) which leverages object-level representations to ensure that the selected image subsets semantically cover the target classes, including rare ones. We validate OFDS on PASCAL VOC and Cityscapes for object detection and semantic segmentation tasks. Our experiments demonstrate that prior methods which employ image-level representations fail to consistently outperform random selection. In contrast, OFDS consistently achieves state-of-the-art performance with substantial improvements over all baselines in scenarios with imbalanced class distributions. Moreover, we demonstrate that pre-training with autolabels on the full datasets before fine-tuning on human-labeled subsets selected by OFDS further enhances the final performance.

Title: SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution

Authors: Runyi Hu, Jie Zhang, Yiming Li, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10049
Pdf URL: https://arxiv.org/pdf/2412.10049
Copy Paste: [[2412.10049]] SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution(https://arxiv.org/abs/2412.10049)
Keywords: diffusion
Abstract: In today's digital landscape, the blending of AI-generated and authentic content has underscored the need for copyright protection and content authentication. Watermarking has become a vital tool to address these challenges, safeguarding both generated and real content. Effective watermarking methods must withstand various distortions and attacks. Current deep watermarking techniques often use an encoder-noise layer-decoder architecture and include distortions to enhance robustness. However, they struggle to balance robustness and fidelity and remain vulnerable to adaptive attacks, despite extensive training. To overcome these limitations, we propose SuperMark, a robust, training-free watermarking framework. Inspired by the parallels between watermark embedding/extraction in watermarking and the denoising/noising processes in diffusion models, SuperMark embeds the watermark into initial Gaussian noise using existing techniques. It then applies pre-trained Super-Resolution (SR) models to denoise the watermarked noise, producing the final watermarked image. For extraction, the process is reversed: the watermarked image is inverted back to the initial watermarked noise via DDIM Inversion, from which the embedded watermark is extracted. This flexible framework supports various noise injection methods and diffusion-based SR models, enabling enhanced customization. The robustness of the DDIM Inversion process against perturbations allows SuperMark to achieve strong resilience to distortions while maintaining high fidelity. Experiments demonstrate that SuperMark achieves fidelity comparable to existing methods while significantly improving robustness. Under standard distortions, it achieves an average watermark extraction accuracy of 99.46%, and 89.29% under adaptive attacks. Moreover, SuperMark shows strong transferability across datasets, SR models, embedding methods, and resolutions.

Title: Quaffure: Real-Time Quasi-Static Neural Hair Simulation

Authors: Tuur Stuyck, Gene Wei-Chin Lin, Egor Larionov, Hsiao-yu Chen, Aljaz Bozic, Nikolaos Sarafianos, Doug Roble
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10061
Pdf URL: https://arxiv.org/pdf/2412.10061
Copy Paste: [[2412.10061]] Quaffure: Real-Time Quasi-Static Neural Hair Simulation(https://arxiv.org/abs/2412.10061)
Keywords: self-supervised
Abstract: Realistic hair motion is crucial for high-quality avatars, but it is often limited by the computational resources available for real-time applications. To address this challenge, we propose a novel neural approach to predict physically plausible hair deformations that generalizes to various body poses, shapes, and hairstyles. Our model is trained using a self-supervised loss, eliminating the need for expensive data generation and storage. We demonstrate our method's effectiveness through numerous results across a wide range of pose and shape variations, showcasing its robust generalization capabilities and temporally smooth results. Our approach is highly suitable for real-time applications with an inference time of only a few milliseconds on consumer hardware and its ability to scale to predicting the drape of 1000 grooms in 0.3 seconds.

Title: RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector

Authors: Zhensheng Wang, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10104
Pdf URL: https://arxiv.org/pdf/2412.10104
Copy Paste: [[2412.10104]] RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector(https://arxiv.org/abs/2412.10104)
Keywords: in-context
Abstract: The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated question-answering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering. The dataset and code are publicly available at \url{this https URL}.

Title: Filter or Compensate: Towards Invariant Representation from Distribution Shift for Anomaly Detection

Authors: Zining Chen, Xingshuang Luo, Weiqiu Wang, Zhicheng Zhao, Fei Su, Aidong Men
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10115
Pdf URL: https://arxiv.org/pdf/2412.10115
Copy Paste: [[2412.10115]] Filter or Compensate: Towards Invariant Representation from Distribution Shift for Anomaly Detection(https://arxiv.org/abs/2412.10115)
Keywords: anomaly
Abstract: Recent Anomaly Detection (AD) methods have achieved great success with In-Distribution (ID) data. However, real-world data often exhibits distribution shift, causing huge performance decay on traditional AD methods. From this perspective, few previous work has explored AD with distribution shift, and the distribution-invariant normality learning has been proposed based on the Reverse Distillation (RD) framework. However, we observe the misalignment issue between the teacher and the student network that causes detection failure, thereby propose FiCo, Filter or Compensate, to address the distribution shift issue in AD. FiCo firstly compensates the distribution-specific information to reduce the misalignment between the teacher and student network via the Distribution-Specific Compensation (DiSCo) module, and secondly filters all abnormal information to capture distribution-invariant normality with the Distribution-Invariant Filter (DiIFi) module. Extensive experiments on three different AD benchmarks demonstrate the effectiveness of FiCo, which outperforms all existing state-of-the-art (SOTA) methods, and even achieves better results on the ID scenario compared with RD-based methods. Our code is available at this https URL.

Title: The Art of Deception: Color Visual Illusions and Diffusion Models

Authors: Alex Gomez-Villa, Kai Wang, Alejandro C. Parraga, Bartlomiej Twardowski, Jesus Malo, Javier Vazquez-Corral, Joost van de Weijer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10122
Pdf URL: https://arxiv.org/pdf/2412.10122
Copy Paste: [[2412.10122]] The Art of Deception: Color Visual Illusions and Diffusion Models(https://arxiv.org/abs/2412.10122)
Keywords: diffusion
Abstract: Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions. This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.

Title: Feature Selection for Latent Factor Models

Authors: Rittwika Kansabanik, Adrian Barbu
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2412.10128
Pdf URL: https://arxiv.org/pdf/2412.10128
Copy Paste: [[2412.10128]] Feature Selection for Latent Factor Models(https://arxiv.org/abs/2412.10128)
Keywords: generative
Abstract: Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class. This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach has theoretical true feature recovery guarantees under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.

Title: ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10138
Pdf URL: https://arxiv.org/pdf/2412.10138
Copy Paste: [[2412.10138]] ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL(https://arxiv.org/abs/2412.10138)
Keywords: in-context
Abstract: Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model's understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

Title: SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Authors: Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen, Rang Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10178
Pdf URL: https://arxiv.org/pdf/2412.10178
Copy Paste: [[2412.10178]] SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models(https://arxiv.org/abs/2412.10178)
Keywords: diffusion
Abstract: Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. While significant advances have been made in image-based virtual try-ons, extending these successes to video often results in frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequence. To address these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we introduce ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the \dataname~dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments show that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. Data and code are available at this https URL

Title: Simple Guidance Mechanisms for Discrete Diffusion Models

Authors: Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, Volodymyr Kuleshov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10193
Pdf URL: https://arxiv.org/pdf/2412.10193
Copy Paste: [[2412.10193]] Simple Guidance Mechanisms for Discrete Diffusion Models(https://arxiv.org/abs/2412.10193)
Keywords: diffusion
Abstract: Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.

Title: Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Authors: Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10208
Pdf URL: https://arxiv.org/pdf/2412.10208
Copy Paste: [[2412.10208]] Efficient Generative Modeling with Residual Vector Quantization-Based Tokens(https://arxiv.org/abs/2412.10208)
Keywords: diffusion, generative
Abstract: We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at this https URL

Title: GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

Authors: Jiapeng Tang, Davide Davoli, Tobias Kirschstein, Liam Schoneveld, Matthias Niessner
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10209
Pdf URL: https://arxiv.org/pdf/2412.10209
Copy Paste: [[2412.10209]] GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion(https://arxiv.org/abs/2412.10209)
Keywords: diffusion
Abstract: We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34\% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.

Title: Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

Authors: Nikolai Warner, Jack Kolb, Meera Hahn, Vighnesh Birodkar, Jonathan Huang, Irfan Essa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10219
Pdf URL: https://arxiv.org/pdf/2412.10219
Copy Paste: [[2412.10219]] Learning Complex Non-Rigid Image Edits from Multimodal Conditioning(https://arxiv.org/abs/2412.10219)
Keywords: diffusion
Abstract: In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.

Title: Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication

Authors: Alireza Furutanpey, Pantelis A. Frangoudis, Patrik Szabo, Schahram Dustdar
Subjects: cs.LG, cs.DC, cs.NI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.10265
Pdf URL: https://arxiv.org/pdf/2412.10265
Copy Paste: [[2412.10265]] Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication(https://arxiv.org/abs/2412.10265)
Keywords: generative
Abstract: This paper investigates the adversarial robustness of Deep Neural Networks (DNNs) using Information Bottleneck (IB) objectives for task-oriented communication systems. We empirically demonstrate that while IB-based approaches provide baseline resilience against attacks targeting downstream tasks, the reliance on generative models for task-oriented communication introduces new vulnerabilities. Through extensive experiments on several datasets, we analyze how bottleneck depth and task complexity influence adversarial robustness. Our key findings show that Shallow Variational Bottleneck Injection (SVBI) provides less adversarial robustness compared to Deep Variational Information Bottleneck (DVIB) approaches, with the gap widening for more complex tasks. Additionally, we reveal that IB-based objectives exhibit stronger robustness against attacks focusing on salient pixels with high intensity compared to those perturbing many pixels with lower intensity. Lastly, we demonstrate that task-oriented communication systems that rely on generative models to extract and recover salient information have an increased attack surface. The results highlight important security considerations for next-generation communication systems that leverage neural networks for goal-oriented compression.

Title: Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media

Authors: Jiaqing Yuan, Ruijie Xi, Munindar P. Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10266
Pdf URL: https://arxiv.org/pdf/2412.10266
Copy Paste: [[2412.10266]] Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media(https://arxiv.org/abs/2412.10266)
Keywords: generative
Abstract: Stance detection is crucial for fostering a human-centric Web by analyzing user-generated content to identify biases and harmful narratives that undermine trust. With the development of Large Language Models (LLMs), existing approaches treat stance detection as a classification problem, providing robust methodologies for modeling complex group interactions and advancing capabilities in natural language tasks. However, these methods often lack interpretability, limiting their ability to offer transparent and understandable justifications for predictions. This study adopts a generative approach, where stance predictions include explicit, interpretable rationales, and integrates them into smaller language models through single-task and multitask learning. We find that incorporating reasoning into stance detection enables the smaller model (FlanT5) to outperform GPT-3.5's zero-shot performance, achieving an improvement of up to 9.57%. Moreover, our results show that reasoning capabilities enhance multitask learning performance but may reduce effectiveness in single-task settings. Crucially, we demonstrate that faithful rationales improve rationale distillation into SLMs, advancing efforts to build interpretable, trustworthy systems for addressing discrimination, fostering trust, and promoting equitable engagement on social media.

Title: Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry

Authors: Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10273
Pdf URL: https://arxiv.org/pdf/2412.10273
Copy Paste: [[2412.10273]] Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry(https://arxiv.org/abs/2412.10273)
Keywords: diffusion
Abstract: We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" models the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation in a multiview image format to coordinate the generation of multiple target views simultaneously. We facilitate correspondence between views by assuming fixed target camera poses relative to the source camera, and constructing a predictable distribution of geometric features per target. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats SoTA baselines such as CAT3D and One-2-3-45 on held-out objects from ObjaverseXL, as well as real-world objects ranging from Google Scanned Objects, Amazon Berkeley Objects, to the Digital Twin Catalog.

Title: TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Authors: Xingrui Wang, Xin Li, Yaosi Hu, Hanxin Zhu, Chen Hou, Cuiling Lan, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10275
Pdf URL: https://arxiv.org/pdf/2412.10275
Copy Paste: [[2412.10275]] TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation(https://arxiv.org/abs/2412.10275)
Keywords: diffusion
Abstract: Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.

Title: Coherent 3D Scene Diffusion From a Single RGB Image

Authors: Manuel Dahnert, Angela Dai, Norman Müller, Matthias Nießner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10294
Pdf URL: https://arxiv.org/pdf/2412.10294
Copy Paste: [[2412.10294]] Coherent 3D Scene Diffusion From a Single RGB Image(https://arxiv.org/abs/2412.10294)
Keywords: diffusion, generative
Abstract: We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.

Title: BrushEdit: All-In-One Image Inpainting and Editing

Authors: Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Qiang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10316
Pdf URL: https://arxiv.org/pdf/2412.10316
Copy Paste: [[2412.10316]] BrushEdit: All-In-One Image Inpainting and Editing(https://arxiv.org/abs/2412.10316)
Keywords: diffusion
Abstract: Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.

Title: Generative AI in Medicine

Authors: Divya Shanmugam, Monica Agrawal, Rajiv Movva, Irene Y. Chen, Marzyeh Ghassemi, Emma Pierson
Subjects: cs.LG, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2412.10337
Pdf URL: https://arxiv.org/pdf/2412.10337
Copy Paste: [[2412.10337]] Generative AI in Medicine(https://arxiv.org/abs/2412.10337)
Keywords: generative
Abstract: The increased capabilities of generative AI have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges -- including maintaining privacy and security, improving transparency and interpretability, upholding equity, and rigorously evaluating models -- which must be overcome to realize this potential, and the open research directions they give rise to.

Title: A Universal Degradation-based Bridging Technique for Domain Adaptive Semantic Segmentation

Authors: Wangkai Li, Rui Sun, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10339
Pdf URL: https://arxiv.org/pdf/2412.10339
Copy Paste: [[2412.10339]] A Universal Degradation-based Bridging Technique for Domain Adaptive Semantic Segmentation(https://arxiv.org/abs/2412.10339)
Keywords: diffusion
Abstract: Semantic segmentation often suffers from significant performance degradation when the trained network is applied to a different domain. To address this issue, unsupervised domain adaptation (UDA) has been extensively studied. Existing methods introduce the domain bridging techniques to mitigate substantial domain gap, which construct intermediate domains to facilitate the gradual transfer of knowledge across different domains. However, these strategies often require dataset-specific designs and may generate unnatural intermediate distributions that lead to semantic shift. In this paper, we propose DiDA, a universal degradation-based bridging technique formalized as a diffusion forward process. DiDA consists of two key modules: (1) Degradation-based Intermediate Domain Construction, which creates continuous intermediate domains through simple image degradation operations to encourage learning domain-invariant features as domain differences gradually diminish; (2) Semantic Shift Compensation, which leverages a diffusion encoder to encode and compensate for semantic shift information with degraded time-steps, preserving discriminative representations in the intermediate domains. As a plug-and-play solution, DiDA supports various degradation operations and seamlessly integrates with existing UDA methods. Extensive experiments on prevalent synthetic-to-real semantic segmentation benchmarks demonstrate that DiDA consistently improves performance across different settings and achieves new state-of-the-art results when combined with existing methods.