2024-12-10

Title: FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector

Authors: Jiankang Chen, Ling Deng, Zhiyong Gan, Wei-Shi Zheng, Ruixuan Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05293
Pdf URL: https://arxiv.org/pdf/2412.05293
Copy Paste: [[2412.05293]] FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector(https://arxiv.org/abs/2412.05293)
Keywords: diffusion, foundation model
Abstract: Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model's overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data collection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2's image captioning capability, CLIP's vision-language knowledge, and Stable Diffusion's image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO's object detection ability is utilized to help construct pure background images by blurring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD images from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The code is available at \url{this https URL}.

Title: Self-Supervised Learning for Graph-Structured Data in Healthcare Applications: A Comprehensive Review

Authors: Safa Ben Atitallah, Chaima Ben Rabah, Maha Driss, Wadii Boulila, Anis Koubaa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.05312
Pdf URL: https://arxiv.org/pdf/2412.05312
Copy Paste: [[2412.05312]] Self-Supervised Learning for Graph-Structured Data in Healthcare Applications: A Comprehensive Review(https://arxiv.org/abs/2412.05312)
Keywords: self-supervised
Abstract: The abundance of complex and interconnected healthcare data offers numerous opportunities to improve prediction, diagnosis, and treatment. Graph-structured data, which includes entities and their relationships, is well-suited for capturing complex connections. Effectively utilizing this data often requires strong and efficient learning algorithms, especially when dealing with limited labeled data. It is increasingly important for downstream tasks in various domains to utilize self-supervised learning (SSL) as a paradigm for learning and optimizing effective representations from unlabeled data. In this paper, we thoroughly review SSL approaches specifically designed for graph-structured data in healthcare applications. We explore the challenges and opportunities associated with healthcare data and assess the effectiveness of SSL techniques in real-world healthcare applications. Our discussion encompasses various healthcare settings, such as disease prediction, medical image analysis, and drug discovery. We critically evaluate the performance of different SSL methods across these tasks, highlighting their strengths, limitations, and potential future research directions. Ultimately, this review aims to be a valuable resource for both researchers and practitioners looking to utilize SSL for graph-structured data in healthcare, paving the way for improved outcomes and insights in this critical field. To the best of our knowledge, this work represents the first comprehensive review of the literature on SSL applied to graph data in healthcare.

Title: Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Authors: Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, Marco Pavone
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.05334
Pdf URL: https://arxiv.org/pdf/2412.05334
Copy Paste: [[2412.05334]] Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models(https://arxiv.org/abs/2412.05334)
Keywords: generative
Abstract: Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. The code is available at this https URL.

Title: Generative Model-Based Fusion for Improved Few-Shot Semantic Segmentation of Infrared Images

Authors: Junno Yun, Mehmet Akçakaya
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05341
Pdf URL: https://arxiv.org/pdf/2412.05341
Copy Paste: [[2412.05341]] Generative Model-Based Fusion for Improved Few-Shot Semantic Segmentation of Infrared Images(https://arxiv.org/abs/2412.05341)
Keywords: generative
Abstract: Infrared (IR) imaging is commonly used in various scenarios, including autonomous driving, fire safety and defense applications. Thus, semantic segmentation of such images is of great interest. However, this task faces several challenges, including data scarcity, differing contrast and input channel number compared to natural images, and emergence of classes not represented in databases in certain scenarios, such as defense applications. Few-shot segmentation (FSS) provides a framework to overcome these issues by segmenting query images using a few labeled support samples. However, existing FSS models for IR images require paired visible RGB images, which is a major limitation since acquiring such paired data is difficult or impossible in some applications. In this work, we develop new strategies for FSS of IR images by using generative modeling and fusion techniques. To this end, we propose to synthesize auxiliary data to provide additional channel information to complement the limited contrast in the IR images, as well as IR data synthesis for data augmentation. Here, the former helps the FSS model to better capture the relationship between the support and query sets, while the latter addresses the issue of data scarcity. Finally, to further improve the former aspect, we propose a novel fusion ensemble module for integrating the two different modalities. Our methods are evaluated on different IR datasets, and improve upon the state-of-the-art (SOTA) FSS models.

Title: MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Connor Dunlop, Pinar Yanardag
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05355
Pdf URL: https://arxiv.org/pdf/2412.05355
Copy Paste: [[2412.05355]] MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance(https://arxiv.org/abs/2412.05355)
Keywords: diffusion
Abstract: In this work, we propose the first motion transfer approach in diffusion transformer through Mixture of Score Guidance (MSG), a theoretically-grounded framework for motion transfer in diffusion models. Our key theoretical contribution lies in reformulating conditional score to decompose motion score and content score in diffusion models. By formulating motion transfer as a mixture of potential energies, MSG naturally preserves scene composition and enables creative scene transformations while maintaining the integrity of transferred motion patterns. This novel sampling operates directly on pre-trained video diffusion models without additional training or fine-tuning. Through extensive experiments, MSG demonstrates successful handling of diverse scenarios including single object, multiple objects, and cross-object motion transfer as well as complex camera motion transfer. Additionally, we introduce MotionBench, the first motion transfer dataset consisting of 200 source videos and 1000 transferred motions, covering single/multi-object transfers, and complex camera motions.

Title: Tabular data generation with tensor contraction layers and transformers

Authors: Aníbal Silva, André Restivo, Moisés Santos, Carlos Soares
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.05390
Pdf URL: https://arxiv.org/pdf/2412.05390
Copy Paste: [[2412.05390]] Tabular data generation with tensor contraction layers and transformers(https://arxiv.org/abs/2412.05390)
Keywords: generative
Abstract: Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.

Title: DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Authors: Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2412.05430
Pdf URL: https://arxiv.org/pdf/2412.05430
Copy Paste: [[2412.05430]] DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA(https://arxiv.org/abs/2412.05430)
Keywords: self-supervised
Abstract: Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various genomic prediction, interpretation and design tasks. Despite their potential, existing benchmarks do not adequately assess the capabilities of DNALMs on key downstream applications involving an important class of non-coding DNA elements critical for regulating gene activity. In this study, we introduce DART-Eval, a suite of representative benchmarks specifically focused on regulatory DNA to evaluate model performance across zero-shot, probed, and fine-tuned scenarios against contemporary ab initio models as baselines. Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants. We find that current DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, while requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our code is available at this https URL.

Title: COOOL: Challenge Of Out-Of-Label A Novel Benchmark for Autonomous Driving

Authors: Ali K. AlShami, Ananya Kalita, Ryan Rabinowitz, Khang Lam, Rishabh Bezbarua, Terrance Boult, Jugal Kalita
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05462
Pdf URL: https://arxiv.org/pdf/2412.05462
Copy Paste: [[2412.05462]] COOOL: Challenge Of Out-Of-Label A Novel Benchmark for Autonomous Driving(https://arxiv.org/abs/2412.05462)
Keywords: anomaly
Abstract: As the Computer Vision community rapidly develops and advances algorithms for autonomous driving systems, the goal of safer and more efficient autonomous transportation is becoming increasingly achievable. However, it is 2024, and we still do not have fully self-driving cars. One of the remaining core challenges lies in addressing the novelty problem, where self-driving systems still struggle to handle previously unseen situations on the open road. With our Challenge of Out-Of-Label (COOOL) benchmark, we introduce a novel dataset for hazard detection, offering versatile evaluation metrics applicable across various tasks, including novelty-adjacent domains such as Anomaly Detection, Open-Set Recognition, Open Vocabulary, and Domain Adaptation. COOOL comprises over 200 collections of dashcam-oriented videos, annotated by human labelers to identify objects of interest and potential driving hazards. It includes a diverse range of hazards and nuisance objects. Due to the dataset's size and data complexity, COOOL serves exclusively as an evaluation benchmark.

Title: Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data

Authors: Abdulrahman Kerim, Leandro Soriano Marcolino, Erickson R. Nascimento, Richard Jiang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.05466
Pdf URL: https://arxiv.org/pdf/2412.05466
Copy Paste: [[2412.05466]] Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data(https://arxiv.org/abs/2412.05466)
Keywords: diffusion
Abstract: Supervised machine learning methods require large-scale training datasets to perform well in practice. Synthetic data has been showing great progress recently and has been used as a complement to real data. However, there is yet a great urge to assess the usability of synthetically generated data. To this end, we propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets, surpassing existing traditional metrics. By utilizing a UCB-based dynamic approach ensures continual enhancement of model learning. Unlike other approaches, our method effectively adapts to changes in the machine learning model's state and considers the evolving utility of training samples during the training process. We show that our metric is an effective way to rank synthetic images based on their usability. Furthermore, we propose a new attribute-aware bandit pipeline for generating synthetic data by integrating a Large Language Model with Stable Diffusion. Quantitative results show that our approach can boost the performance of a wide range of supervised classifiers. Notably, we observed an improvement of up to 10% in classification accuracy compared to traditional approaches, demonstrating the effectiveness of our approach. Our source code, datasets, and additional materials are publically available at this https URL.

Title: Enhancing Sample Generation of Diffusion Models using Noise Level Correction

Authors: Abulikemu Abuduweili, Chenyang Yuan, Changliu Liu, Frank Permenter
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05488
Pdf URL: https://arxiv.org/pdf/2412.05488
Copy Paste: [[2412.05488]] Enhancing Sample Generation of Diffusion Models using Noise Level Correction(https://arxiv.org/abs/2412.05488)
Keywords: diffusion
Abstract: The denoising process of diffusion models can be interpreted as a projection of noisy samples onto the data manifold. Moreover, the noise level in these samples approximates their distance to the underlying manifold. Building on this insight, we propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold. Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process. Additionally, we extend this approach to various image restoration tasks by integrating task-specific constraints, including inpainting, deblurring, super-resolution, colorization, and compressed sensing. Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios. Notably, the proposed noise level correction framework is compatible with existing denoising schedulers (e.g., DDIM), offering additional performance improvements.

Title: A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

Authors: Pengyu Li, Zhijie Zhong, Tong Zhang, Zhiwen Yu, C.L. Philip Chen, Kaixiang Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05498
Pdf URL: https://arxiv.org/pdf/2412.05498
Copy Paste: [[2412.05498]] A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System(https://arxiv.org/abs/2412.05498)
Keywords: anomaly
Abstract: Time series anomaly detection (TSAD) has been a research hotspot in both academia and industry in recent years. Deep learning methods have become the mainstream research direction due to their excellent performance. However, new viewpoints have emerged in recent TSAD research. Deep learning is not required for TSAD due to limitations such as slow deep learning speed. The Broad Learning System (BLS) is a shallow network framework that benefits from its ease of optimization and speed. It has been shown to outperform machine learning approaches while remaining competitive with deep learning. Based on the current situation of TSAD, we propose the Contrastive Patch-based Broad Learning System (CPatchBLS). This is a new exploration of patching technique and BLS, providing a new perspective for TSAD. We construct Dual-PatchBLS as a base through patching and Simple Kernel Perturbation (SKP) and utilize contrastive learning to capture the differences between normal and abnormal data under different representations. To compensate for the temporal semantic loss caused by various patching, we propose CPatchBLS with model level integration, which takes advantage of BLS's fast feature to build model-level integration and improve model detection. Using five real-world series anomaly detection datasets, we confirmed the method's efficacy, outperforming previous deep learning and machine learning methods while retaining a high level of computing efficiency.

Title: Street Gaussians without 3D Object Tracker

Authors: Ruida Zhang, Chengxi Li, Chenyangguang Zhang, Xingyu Liu, Haili Yuan, Yanyan Li, Xiangyang Ji, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05548
Pdf URL: https://arxiv.org/pdf/2412.05548
Copy Paste: [[2412.05548]] Street Gaussians without 3D Object Tracker(https://arxiv.org/abs/2412.05548)
Keywords: foundation model
Abstract: Realistic scene reconstruction in driving scenarios poses significant challenges due to fast-moving objects. Most existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space and move them based on these poses during rendering. While some approaches attempt to use 3D object trackers to replace manual annotations, the limited generalization of 3D trackers -- caused by the scarcity of large-scale 3D datasets -- results in inferior reconstructions in real-world settings. In contrast, 2D foundation models demonstrate strong generalization capabilities. To eliminate the reliance on 3D trackers and enhance robustness across diverse environments, we propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy. We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections. Experimental results on Waymo-NOTR datasets show we achieve state-of-the-art performance. Our code will be made publicly available.

Title: Text-to-3D Gaussian Splatting with Physics-Grounded Motion Generation

Authors: Wenqing Wang, Yun Fu
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05560
Pdf URL: https://arxiv.org/pdf/2412.05560
Copy Paste: [[2412.05560]] Text-to-3D Gaussian Splatting with Physics-Grounded Motion Generation(https://arxiv.org/abs/2412.05560)
Keywords: diffusion
Abstract: Text-to-3D generation is a valuable technology in virtual reality and digital content creation. While recent works have pushed the boundaries of text-to-3D generation, producing high-fidelity 3D objects with inefficient prompts and simulating their physics-grounded motion accurately still remain unsolved challenges. To address these challenges, we present an innovative framework that utilizes the Large Language Model (LLM)-refined prompts and diffusion priors-guided Gaussian Splatting (GS) for generating 3D models with accurate appearances and geometric structures. We also incorporate a continuum mechanics-based deformation map and color regularization to synthesize vivid physics-grounded motion for the generated 3D Gaussians, adhering to the conservation of mass and momentum. By integrating text-to-3D generation with physics-grounded motion synthesis, our framework renders photo-realistic 3D objects that exhibit physics-aware motion, accurately reflecting the behaviors of the objects under various forces and constraints across different materials. Extensive experiments demonstrate that our approach achieves high-quality 3D generations with realistic physics-grounded motion.

Title: Dif4FF: Leveraging Multimodal Diffusion Models and Graph Neural Networks for Accurate New Fashion Product Performance Forecasting

Authors: Andrea Avogaro, Luigi Capogrosso, Franco Fummi, Marco Cristani
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05566
Pdf URL: https://arxiv.org/pdf/2412.05566
Copy Paste: [[2412.05566]] Dif4FF: Leveraging Multimodal Diffusion Models and Graph Neural Networks for Accurate New Fashion Product Performance Forecasting(https://arxiv.org/abs/2412.05566)
Keywords: diffusion
Abstract: In the fast-fashion industry, overproduction and unsold inventory create significant environmental problems. Precise sales forecasts for unreleased items could drastically improve the efficiency and profits of industries. However, predicting the success of entirely new styles is difficult due to the absence of past data and ever-changing trends. Specifically, currently used deterministic models struggle with domain shifts when encountering items outside their training data. The recently proposed diffusion models address this issue using a continuous-time diffusion process. Specifically, these models enable us to predict the sales of new items, mitigating the domain shift challenges encountered by deterministic models. As a result, this paper proposes Dif4FF, a novel two-stage pipeline for New Fashion Product Performance Forecasting (NFPPF) that leverages the power of diffusion models conditioned on multimodal data related to specific clothes. Dif4FF first utilizes a multimodal score-based diffusion model to forecast multiple sales trajectories for various garments over time. The forecasts are refined using a powerful Graph Convolutional Network (GCN) architecture. By leveraging the GCN's capability to capture long-range dependencies within both the temporal and spatial data and seeking the optimal solution between these two dimensions, Dif4FF offers the most accurate and efficient forecasting system available in the literature for predicting the sales of new items. We tested Dif4FF on VISUELLE, the de facto standard for NFPPF, achieving new state-of-the-art results.

Title: Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

Authors: Ming Tao, Bing-Kun Bao, Yaowei Wang, Changsheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05619
Pdf URL: https://arxiv.org/pdf/2412.05619
Copy Paste: [[2412.05619]] Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC(https://arxiv.org/abs/2412.05619)
Keywords: diffusion, generative
Abstract: Large pretrained diffusion models have demonstrated impressive generation capabilities and have been adapted to various downstream tasks. However, unlike Large Language Models (LLMs) that can learn multiple tasks in a single model based on instructed data, diffusion models always require additional branches, task-specific training strategies, and losses for effective adaptation to different downstream tasks. This task-specific fine-tuning approach brings two drawbacks. 1) The task-specific additional networks create gaps between pretraining and fine-tuning which hinders the transfer of pretrained knowledge. 2) It necessitates careful additional network design, raising the barrier to learning and implementation, and making it less user-friendly. Thus, a question arises: Can we achieve a simple, efficient, and general approach to fine-tune diffusion models? To this end, we propose ONE-PIC. It enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules. Specifically, we propose In-Visual-Context Tuning, which constructs task-specific training data by arranging source images and target images into a single image. This approach makes downstream fine-tuning closer to the pertaining, allowing our model to adapt more quickly to various downstream tasks. Moreover, we propose a Masking Strategy to unify different generative tasks. This strategy transforms various downstream fine-tuning tasks into predictions of the masked portions. The extensive experimental results demonstrate that our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs. Code is available at this https URL.

Title: Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05628
Pdf URL: https://arxiv.org/pdf/2412.05628
Copy Paste: [[2412.05628]] Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising(https://arxiv.org/abs/2412.05628)
Keywords: diffusion, generative
Abstract: Transformer-based diffusion models have achieved significant advancements across a variety of generative tasks. However, producing high-quality outputs typically necessitates large transformer models, which result in substantial training and inference overhead. In this work, we investigate an alternative approach involving multiple experts for denoising, and introduce Remix-DiT, a novel method designed to enhance output quality at a low cost. The goal of Remix-DiT is to craft N diffusion experts for different denoising timesteps, yet without the need for expensive training of N independent models. To achieve this, Remix-DiT employs K basis models (where K < N) and utilizes learnable mixing coefficients to adaptively craft expert models. This design offers two significant advantages: first, although the total model size is increased, the model produced by the mixing operation shares the same architecture as a plain model, making the overall model as efficient as a standard diffusion transformer. Second, the learnable mixing adaptively allocates model capacity across timesteps, thereby effectively improving generation quality. Experiments conducted on the ImageNet dataset demonstrate that Remix-DiT achieves promising results compared to standard diffusion transformers and other multiple-expert methods. The code is available at this https URL.

Title: Biological Brain Age Estimation using Sex-Aware Adversarial Variational Autoencoder with Multimodal Neuroimages

Authors: Abd Ur Rehman, Azka Rehman, Muhammad Usman, Abdullah Shahid, Sung-Min Gho, Aleum Lee, Tariq M. Khan, Imran Razzak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05632
Pdf URL: https://arxiv.org/pdf/2412.05632
Copy Paste: [[2412.05632]] Biological Brain Age Estimation using Sex-Aware Adversarial Variational Autoencoder with Multimodal Neuroimages(https://arxiv.org/abs/2412.05632)
Keywords: generative
Abstract: Brain aging involves structural and functional changes and therefore serves as a key biomarker for brain health. Combining structural magnetic resonance imaging (sMRI) and functional magnetic resonance imaging (fMRI) has the potential to improve brain age estimation by leveraging complementary data. However, fMRI data, being noisier than sMRI, complicates multimodal fusion. Traditional fusion methods often introduce more noise than useful information, which can reduce accuracy compared to using sMRI alone. In this paper, we propose a novel multimodal framework for biological brain age estimation, utilizing a sex-aware adversarial variational autoencoder (SA-AVAE). Our framework integrates adversarial and variational learning to effectively disentangle the latent features from both modalities. Specifically, we decompose the latent space into modality-specific codes and shared codes to represent complementary and common information across modalities, respectively. To enhance the disentanglement, we introduce cross-reconstruction and shared-distinct distance ratio loss as regularization terms. Importantly, we incorporate sex information into the learned latent code, enabling the model to capture sex-specific aging patterns for brain age estimation via an integrated regressor module. We evaluate our model using the publicly available OpenBHB dataset, a comprehensive multi-site dataset for brain age estimation. The results from ablation studies and comparisons with state-of-the-art methods demonstrate that our framework outperforms existing approaches and shows significant robustness across various age groups, highlighting its potential for real-time clinical applications in the early detection of neurodegenerative diseases.

Title: Efficient Continuous Video Flow Model for Video Prediction

Authors: Gaurav Shrivastava, Abhinav Shrivastava
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05633
Pdf URL: https://arxiv.org/pdf/2412.05633
Copy Paste: [[2412.05633]] Efficient Continuous Video Flow Model for Video Prediction(https://arxiv.org/abs/2412.05633)
Keywords: diffusion
Abstract: Multi-step prediction models, such as diffusion and rectified flow models, have emerged as state-of-the-art solutions for generation tasks. However, these models exhibit higher latency in sampling new frames compared to single-step methods. This latency issue becomes a significant bottleneck when adapting such methods for video prediction tasks, given that a typical 60-second video comprises approximately 1.5K frames. In this paper, we propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks. Our approach not only reduces the number of sample steps required to predict the next frame but also minimizes computational demands by reducing the model size to one-third of the original size. We evaluate our method on standard video prediction datasets, including KTH, BAIR action robot, Human3.6M and UCF101, demonstrating its efficacy in achieving state-of-the-art performance on these benchmarks.

Title: Hyperedge Anomaly Detection with Hypergraph Neural Network

Authors: Md. Tanvir Alam, Chowdhury Farhan Ahmed, Carson K. Leung
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2412.05641
Pdf URL: https://arxiv.org/pdf/2412.05641
Copy Paste: [[2412.05641]] Hyperedge Anomaly Detection with Hypergraph Neural Network(https://arxiv.org/abs/2412.05641)
Keywords: anomaly
Abstract: Hypergraph is a data structure that enables us to model higher-order associations among data entities. Conventional graph-structured data can represent pairwise relationships only, whereas hypergraph enables us to associate any number of entities, which is essential in many real-life applications. Hypergraph learning algorithms have been well-studied for numerous problem settings, such as node classification, link prediction, etc. However, much less research has been conducted on anomaly detection from hypergraphs. Anomaly detection identifies events that deviate from the usual pattern and can be applied to hypergraphs to detect unusual higher-order associations. In this work, we propose an end-to-end hypergraph neural network-based model for identifying anomalous associations in a hypergraph. Our proposed algorithm operates in an unsupervised manner without requiring any labeled data. Extensive experimentation on several real-life datasets demonstrates the effectiveness of our model in detecting anomalous hyperedges.

Title: WATER-GS: Toward Copyright Protection for 3D Gaussian Splatting via Universal Watermarking

Authors: Yuqi Tan, Xiang Liu, Shuzhao Xie, Bin Chen, Shu-Tao Xia, Zhi Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.05695
Pdf URL: https://arxiv.org/pdf/2412.05695
Copy Paste: [[2412.05695]] WATER-GS: Toward Copyright Protection for 3D Gaussian Splatting via Universal Watermarking(https://arxiv.org/abs/2412.05695)
Keywords: generative
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for 3D scene representation, providing rapid rendering speeds and high fidelity. As 3DGS gains prominence, safeguarding its intellectual property becomes increasingly crucial since 3DGS could be used to imitate unauthorized scene creations and raise copyright issues. Existing watermarking methods for implicit NeRFs cannot be directly applied to 3DGS due to its explicit representation and real-time rendering process, leaving watermarking for 3DGS largely unexplored. In response, we propose WATER-GS, a novel method designed to protect 3DGS copyrights through a universal watermarking strategy. First, we introduce a pre-trained watermark decoder, treating raw 3DGS generative modules as potential watermark encoders to ensure imperceptibility. Additionally, we implement novel 3D distortion layers to enhance the robustness of the embedded watermark against common real-world distortions of point cloud data. Comprehensive experiments and ablation studies demonstrate that WATER-GS effectively embeds imperceptible and robust watermarks into 3DGS without compromising rendering efficiency and quality. Our experiments indicate that the 3D distortion layers can yield up to a 20% improvement in accuracy rate. Notably, our method is adaptable to different 3DGS variants, including 3DGS compression frameworks and 2D Gaussian splatting.

Title: Segment-Level Road Obstacle Detection Using Visual Foundation Model Priors and Likelihood Ratios

Authors: Youssef Shoeb, Nazir Nayal, Azarm Nowzard, Fatma Güney, Hanno Gottschalk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05707
Pdf URL: https://arxiv.org/pdf/2412.05707
Copy Paste: [[2412.05707]] Segment-Level Road Obstacle Detection Using Visual Foundation Model Priors and Likelihood Ratios(https://arxiv.org/abs/2412.05707)
Keywords: foundation model
Abstract: Detecting road obstacles is essential for autonomous vehicles to navigate dynamic and complex traffic environments safely. Current road obstacle detection methods typically assign a score to each pixel and apply a threshold to generate final predictions. However, selecting an appropriate threshold is challenging, and the per-pixel classification approach often leads to fragmented predictions with numerous false positives. In this work, we propose a novel method that leverages segment-level features from visual foundation models and likelihood ratios to predict road obstacles directly. By focusing on segments rather than individual pixels, our approach enhances detection accuracy, reduces false positives, and offers increased robustness to scene variability. We benchmark our approach against existing methods on the RoadObstacle and LostAndFound datasets, achieving state-of-the-art performance without needing a predefined threshold.

Title: On the effective transfer of knowledge from English to Hindi Wikipedia

Authors: Paramita Das, Amartya Roy, Ritabrata Chakraborty, Animesh Mukherjee
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05708
Pdf URL: https://arxiv.org/pdf/2412.05708
Copy Paste: [[2412.05708]] On the effective transfer of knowledge from English to Hindi Wikipedia(https://arxiv.org/abs/2412.05708)
Keywords: in-context
Abstract: Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia's distinctive style, including its \textit{neutral point of view} (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.

Title: PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic Languages with Example Selection from Related Example Banks

Authors: Soumya Suvra Ghosal, Soumyabrata Pal, Koyel Mukherjee, Dinesh Manocha
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.05710
Pdf URL: https://arxiv.org/pdf/2412.05710
Copy Paste: [[2412.05710]] PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic Languages with Example Selection from Related Example Banks(https://arxiv.org/abs/2412.05710)
Keywords: in-context
Abstract: Large Language Models (LLMs) have recently demonstrated impressive few-shot learning capabilities through in-context learning (ICL). However, ICL performance is highly dependent on the choice of few-shot demonstrations, making the selection of the most optimal examples a persistent research challenge. This issue is further amplified in low-resource Indic languages, where the scarcity of ground-truth data complicates the selection process. In this work, we propose PromptRefine, a novel Alternating Minimization approach for example selection that improves ICL performance on low-resource Indic languages. PromptRefine leverages auxiliary example banks from related high-resource Indic languages and employs multi-task learning techniques to align language-specific retrievers, enabling effective cross-language retrieval. Additionally, we incorporate diversity in the selected examples to enhance generalization and reduce bias. Through comprehensive evaluations on four text generation tasks -- Cross-Lingual Question Answering, Multilingual Question Answering, Machine Translation, and Cross-Lingual Summarization using state-of-the-art LLMs such as LLAMA-3.1-8B, LLAMA-2-7B, Qwen-2-7B, and Qwen-2.5-7B, we demonstrate that PromptRefine significantly outperforms existing frameworks for retrieving examples.

Title: Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

Authors: Ziyuan Qin, Dongjie Cheng, Haoyu Wang, Huahui Yi, Yuting Shao, Zhiyuan Fan, Kang Li, Qicheng Lao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05722
Pdf URL: https://arxiv.org/pdf/2412.05722
Copy Paste: [[2412.05722]] Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent(https://arxiv.org/abs/2412.05722)
Keywords: diffusion
Abstract: Contemporary Text-to-Image (T2I) models frequently depend on qualitative human evaluations to assess the consistency between synthesized images and the text prompts. There is a demand for quantitative and automatic evaluation tools, given that human evaluation lacks reproducibility. We believe that an effective T2I evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts, a discrepancy we define as the `hallucination problem' in T2I tasks; record the types and frequency of hallucination issues, aiding users in understanding the causes of errors; and provide a comprehensive and intuitive scoring that close to human standard. To achieve these objectives, we propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images. From the methodology perspective, we combine knowledge-enhanced question-answering tasks with image evaluation tasks, making the evaluation metrics more controllable and easier to interpret. For the contribution on the dataset side, we generated 12,000 synthesized images based on 1,000 composited prompts using three advanced T2I models. Subsequently, we conduct human scoring on all synthesized images and prompt pairs to validate the accuracy and effectiveness of our method as an evaluation metric. All generated images and the human-labeled scores will be made publicly available in the future to facilitate ongoing research on this crucial issue. Extensive experiments show that our method aligns more closely with human scoring patterns than other evaluation metrics.

Title: A Tiered GAN Approach for Monet-Style Image Generation

Authors: FNU Neha, Deepshikha Bhati, Deepak Kumar Shukla, Md Amiruzzaman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05724
Pdf URL: https://arxiv.org/pdf/2412.05724
Copy Paste: [[2412.05724]] A Tiered GAN Approach for Monet-Style Image Generation(https://arxiv.org/abs/2412.05724)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have proven to be a powerful tool in generating artistic images, capable of mimicking the styles of renowned painters, such as Claude Monet. This paper introduces a tiered GAN model to progressively refine image quality through a multi-stage process, enhancing the generated images at each step. The model transforms random noise into detailed artistic representations, addressing common challenges such as instability in training, mode collapse, and output quality. This approach combines downsampling and convolutional techniques, enabling the generation of high-quality Monet-style artwork while optimizing computational efficiency. Experimental results demonstrate the architecture's ability to produce foundational artistic structures, though further refinements are necessary for achieving higher levels of realism and fidelity to Monet's style. Future work focuses on improving training methodologies and model complexity to bridge the gap between generated and true artistic images. Additionally, the limitations of traditional GANs in artistic generation are analyzed, and strategies to overcome these shortcomings are proposed.

Title: Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Authors: Aditya Chinchure, Sahithya Ravi, Raymond Ng, Vered Shwartz, Boyang Li, Leonid Sigal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05725
Pdf URL: https://arxiv.org/pdf/2412.05725
Copy Paste: [[2412.05725]] Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events(https://arxiv.org/abs/2412.05725)
Keywords: generative
Abstract: The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no tasks, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies.

Title: BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Authors: Qinchan (Wing)Li, Kenneth Chen, Changyue (Tina)Su, Qi Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05780
Pdf URL: https://arxiv.org/pdf/2412.05780
Copy Paste: [[2412.05780]] BudgetFusion: Perceptually-Guided Adaptive Diffusion Models(https://arxiv.org/abs/2412.05780)
Keywords: diffusion, generative
Abstract: Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?

Title: Open-Source Acceleration of Stable-Diffusion.cpp

Authors: Jingxu Ng, Cheng Lv, Pu Zhao, Wei Niu, Juyi Lin, Yanzhi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05781
Pdf URL: https://arxiv.org/pdf/2412.05781
Copy Paste: [[2412.05781]] Open-Source Acceleration of Stable-Diffusion.cpp(https://arxiv.org/abs/2412.05781)
Keywords: diffusion
Abstract: Stable diffusion plays a crucial role in generating high-quality images. However, image generation is time-consuming and memory-intensive. To address this, this http URL (Sdcpp) emerges as an efficient inference framework to accelerate the diffusion models. Although it is lightweight, the current implementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both high inference latency and massive memory usage. To address this, in this work, we present an optimized version of Sdcpp leveraging the Winograd algorithm to accelerate 2D convolution operations, which is the primary bottleneck in the pipeline. By analyzing both dependent and independent computation graphs, we exploit the device's locality and parallelism to achieve substantial performance improvements. Our framework delivers correct end-to-end results across various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and SDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process, compared with the original Sdcpp. Homepage: this https URL

Title: Language-Guided Image Tokenization for Generation

Authors: Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05796
Pdf URL: https://arxiv.org/pdf/2412.05796
Copy Paste: [[2412.05796]] Language-Guided Image Tokenization for Generation(https://arxiv.org/abs/2412.05796)
Keywords: diffusion
Abstract: Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.

Title: Self-Supervised Learning with Probabilistic Density Labeling for Rainfall Probability Estimation

Authors: Junha Lee, Sojung An, Sujeong You, Namik Cho
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.05825
Pdf URL: https://arxiv.org/pdf/2412.05825
Copy Paste: [[2412.05825]] Self-Supervised Learning with Probabilistic Density Labeling for Rainfall Probability Estimation(https://arxiv.org/abs/2412.05825)
Keywords: self-supervised
Abstract: Numerical weather prediction (NWP) models are fundamental in meteorology for simulating and forecasting the behavior of various atmospheric variables. The accuracy of precipitation forecasts and the acquisition of sufficient lead time are crucial for preventing hazardous weather events. However, the performance of NWP models is limited by the nonlinear and unpredictable patterns of extreme weather phenomena driven by temporal dynamics. In this regard, we propose a \textbf{S}elf-\textbf{S}upervised \textbf{L}earning with \textbf{P}robabilistic \textbf{D}ensity \textbf{L}abeling (SSLPDL) for estimating rainfall probability by post-processing NWP forecasts. Our post-processing method uses self-supervised learning (SSL) with masked modeling for reconstructing atmospheric physics variables, enabling the model to learn the dependency between variables. The pre-trained encoder is then utilized in transfer learning to a precipitation segmentation task. Furthermore, we introduce a straightforward labeling approach based on probability density to address the class imbalance in extreme weather phenomena like heavy rain events. Experimental results show that SSLPDL surpasses other precipitation forecasting models in regional precipitation post-processing and demonstrates competitive performance in extending forecast lead times. Our code is available at this https URL

Title: Self-Guidance: Boosting Flow and Diffusion Generation on Their Own

Authors: Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, Guo-Jun Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05827
Pdf URL: https://arxiv.org/pdf/2412.05827
Copy Paste: [[2412.05827]] Self-Guidance: Boosting Flow and Diffusion Generation on Their Own(https://arxiv.org/abs/2412.05827)
Keywords: diffusion
Abstract: Proper guidance strategies are essential to get optimal generation results without re-training diffusion and flow-based text-to-image models. However, existing guidances either require specific training or strong inductive biases of neural network architectures, potentially limiting their applications. To address these issues, in this paper, we introduce Self-Guidance (SG), a strong diffusion guidance that neither needs specific training nor requires certain forms of neural network architectures. Different from previous approaches, the Self-Guidance calculates the guidance vectors by measuring the difference between the velocities of two successive diffusion timesteps. Therefore, SG can be readily applied for both conditional and unconditional models with flexible network architectures. We conduct intensive experiments on both text-to-image generation and text-to-video generations across flexible architectures including UNet-based models and diffusion transformer-based models. On current state-of-the-art diffusion models such as Stable Diffusion 3.5 and FLUX, SG significantly boosts the image generation performance in terms of FID, and Human Preference Scores. Moreover, we find that SG has a surprisingly positive effect on the generation of high-quality human bodies such as hands, faces, and arms, showing strong potential to overcome traditional challenges on human body generations with minimal effort. We will release our implementation of SG on SD 3.5 and FLUX models along with this paper.

Title: CSG: A Context-Semantic Guided Diffusion Approach in De Novo Musculoskeletal Ultrasound Image Generation

Authors: Elay Dahan, Hedda Cohen Indelman, Angeles M. Perez-Agosto, Carmit Shiran, Gopal Avinash, Doron Shaked, Nati Daniel
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05833
Pdf URL: https://arxiv.org/pdf/2412.05833
Copy Paste: [[2412.05833]] CSG: A Context-Semantic Guided Diffusion Approach in De Novo Musculoskeletal Ultrasound Image Generation(https://arxiv.org/abs/2412.05833)
Keywords: diffusion, generative
Abstract: The use of synthetic images in medical imaging Artificial Intelligence (AI) solutions has been shown to be beneficial in addressing the limited availability of diverse, unbiased, and representative data. Despite the extensive use of synthetic image generation methods, controlling the semantics variability and context details remains challenging, limiting their effectiveness in producing diverse and representative medical image datasets. In this work, we introduce a scalable semantic and context-conditioned generative model, coined CSG (Context-Semantic Guidance). This dual conditioning approach allows for comprehensive control over both structure and appearance, advancing the synthesis of realistic and diverse ultrasound images. We demonstrate the ability of CSG to generate findings (pathological anomalies) in musculoskeletal (MSK) ultrasound images. Moreover, we test the quality of the synthetic images using a three-fold validation protocol. The results show that the synthetic images generated by CSG improve the performance of semantic segmentation models, exhibit enhanced similarity to real images compared to the baseline methods, and are undistinguishable from real images according to a Turing test. Furthermore, we demonstrate an extension of the CSG that allows enhancing the variability space of images by synthetically generating augmentations of anatomical geometries and textures.

Title: MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Authors: Shuwei Shi, Biao Gong, Xi Chen, Dandan Zheng, Shuai Tan, Zizheng Yang, Yuyuan Li, Jingwen He, Kecheng Zheng, Jingdong Chen, Ming Yang, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05848
Pdf URL: https://arxiv.org/pdf/2412.05848
Copy Paste: [[2412.05848]] MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation(https://arxiv.org/abs/2412.05848)
Keywords: diffusion
Abstract: The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.

Title: 3D-Consistent Image Inpainting with Diffusion Models

Authors: Leonid Antsfeld, Boris Chidlovskii
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05881
Pdf URL: https://arxiv.org/pdf/2412.05881
Copy Paste: [[2412.05881]] 3D-Consistent Image Inpainting with Diffusion Models(https://arxiv.org/abs/2412.05881)
Keywords: diffusion, generative, in-context
Abstract: We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.

Title: MCP-MedSAM: A Powerful Lightweight Medical Segment Anything Model Trained with a Single GPU in Just One Day

Authors: Donghang Lyu, Ruochen Gao, Marius Staring
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05888
Pdf URL: https://arxiv.org/pdf/2412.05888
Copy Paste: [[2412.05888]] MCP-MedSAM: A Powerful Lightweight Medical Segment Anything Model Trained with a Single GPU in Just One Day(https://arxiv.org/abs/2412.05888)
Keywords: foundation model
Abstract: Medical image segmentation involves partitioning medical images into meaningful regions, with a focus on identifying anatomical structures or abnormalities. It has broad applications in healthcare, and deep learning methods have enabled significant advancements in automating this process. Recently, the introduction of the Segmentation Anything Model (SAM), the first foundation model for segmentation task, has prompted researchers to adapt it for the medical domain to improve performance across various tasks. However, SAM's large model size and high GPU requirements hinder its scalability and development in the medical domain. To address these challenges, research has increasingly focused on lightweight adaptations of SAM to reduce its parameter count, enabling training with limited GPU resources while maintaining competitive segmentation performance. In this work, we propose MCP-MedSAM, a powerful and lightweight medical SAM model designed to be trainable on a single GPU within one day while delivering superior segmentation performance. Our method was trained and evaluated using a large-scale challenge dataset\footnote{\url{this https URL}\label{comp}}, compared to top-ranking methods on the challenge leaderboard, MCP-MedSAM achieved superior performance while requiring only one day of training on a single GPU. The code is publicly available at \url{this https URL}.

Title: XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Authors: Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.05896
Pdf URL: https://arxiv.org/pdf/2412.05896
Copy Paste: [[2412.05896]] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference(https://arxiv.org/abs/2412.05896)
Keywords: generative
Abstract: Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache framework makes a compromise between time and space complexities. However, caching data generates the increasingly growing memory demand, that can quickly exhaust the limited memory capacity of the modern accelerator like GPUs, particularly in long-context inference tasks. Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy. But the benefit in practice is far from ideal due to the static cache allocation across different LLM network layers. This paper observes that the layer-specific cached data have very different impacts on accuracy. We quantify this difference, and give experimental and theoretical validation. We accordingly make a formal analysis and shows that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction, while still providing comparable accuracy. We simulate the cache allocation as a combinatorial optimization problem and give a global optimal solution. In particular, we devise a mini- and sampling-based inference over a lightweight variant of the LLM model, so as to quickly capture the difference and then feed it into the personalized algorithms. Extensive experiments on real-world datasets demonstrate that our proposals can reduce KV cache memory consumption by 61.6% on average, improve computational efficiency by 2.1x and then increase the throughput by up to 5.5x.

Title: Accelerating Video Diffusion Models via Distribution Matching

Authors: Yuanzhi Zhu, Hanshu Yan, Huan Yang, Kai Zhang, Junnan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05899
Pdf URL: https://arxiv.org/pdf/2412.05899
Copy Paste: [[2412.05899]] Accelerating Video Diffusion Models via Distribution Matching(https://arxiv.org/abs/2412.05899)
Keywords: diffusion, generative
Abstract: Generative models, particularly diffusion models, have made significant success in data synthesis across various modalities, including images, videos, and 3D assets. However, current diffusion models are computationally intensive, often requiring numerous sampling steps that limit their practical application, especially in video generation. This work introduces a novel framework for diffusion distillation and distribution matching that dramatically reduces the number of inference steps while maintaining-and potentially improving-generation quality. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator, specifically targeting video generation. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames with substantially fewer sampling steps. To be specific, the proposed method incorporates a denoising GAN discriminator to distil from the real data and a pre-trained image diffusion model to enhance the frame quality and the prompt-following capabilities. Experimental results using AnimateDiff as the teacher model showcase the method's effectiveness, achieving superior performance in just four sampling steps compared to existing techniques.

Title: GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting and Meshing

Authors: Jianing Zhang, Yuchao Zheng, Ziwei Li, Qionghai Dai, Xiaoyun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05908
Pdf URL: https://arxiv.org/pdf/2412.05908
Copy Paste: [[2412.05908]] GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting and Meshing(https://arxiv.org/abs/2412.05908)
Keywords: diffusion, generative
Abstract: Gaussian splatting has gained attention for its efficient representation and rendering of 3D scenes using continuous Gaussian primitives. However, it struggles with sparse-view inputs due to limited geometric and photometric information, causing ambiguities in depth, shape, and texture. we propose GBR: Generative Bundle Refinement, a method for high-fidelity Gaussian splatting and meshing using only 4-6 input views. GBR integrates a neural bundle adjustment module to enhance geometry accuracy and a generative depth refinement module to improve geometry fidelity. More specifically, the neural bundle adjustment module integrates a foundation network to produce initial 3D point maps and point matches from unposed images, followed by bundle adjustment optimization to improve multiview consistency and point cloud accuracy. The generative depth refinement module employs a diffusion-based strategy to enhance geometric details and fidelity while preserving the scale. Finally, for Gaussian splatting optimization, we propose a multimodal loss function incorporating depth and normal consistency, geometric regularization, and pseudo-view supervision, providing robust guidance under sparse-view conditions. Experiments on widely used datasets show that GBR significantly outperforms existing methods under sparse-view inputs. Additionally, GBR demonstrates the ability to reconstruct and render large-scale real-world scenes, such as the Pavilion of Prince Teng and the Great Wall, with remarkable details using only 6 views.

Title: BiDM: Pushing the Limit of Quantization for Diffusion Models

Authors: Xingyu Zheng, Xianglong Liu, Yichen Bian, Xudong Ma, Yulun Zhang, Jiakai Wang, Jinyang Guo, Haotong Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05926
Pdf URL: https://arxiv.org/pdf/2412.05926
Copy Paste: [[2412.05926]] BiDM: Pushing the Limit of Quantization for Diffusion Models(https://arxiv.org/abs/2412.05926)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) have been significantly developed and widely used in various applications due to their excellent generative qualities. However, the expensive computation and massive parameters of DMs hinder their practical use in resource-constrained scenarios. As one of the effective compression approaches, quantization allows DMs to achieve storage saving and inference acceleration by reducing bit-width while maintaining generation performance. However, as the most extreme quantization form, 1-bit binarization causes the generation performance of DMs to face severe degradation or even collapse. This paper proposes a novel method, namely BiDM, for fully binarizing weights and activations of DMs, pushing quantization to the 1-bit limit. From a temporal perspective, we introduce the Timestep-friendly Binary Structure (TBS), which uses learnable activation binarizers and cross-timestep feature connections to address the highly timestep-correlated activation features of DMs. From a spatial perspective, we propose Space Patched Distillation (SPD) to address the difficulty of matching binary features during distillation, focusing on the spatial locality of image generation tasks and noise estimation networks. As the first work to fully binarize DMs, the W1A1 BiDM on the LDM-4 model for LSUN-Bedrooms 256$\times$256 achieves a remarkable FID of 22.74, significantly outperforming the current state-of-the-art general binarization methods with an FID of 59.44 and invalid generative samples, and achieves up to excellent 28.0 times storage and 52.7 times OPs savings. The code is available at this https URL .

Title: Enhanced 3D Generation by 2D Editing

Authors: Haoran Li, Yuli Tian, Yong Liao, Lin Wang, Yuyang Wang, Peng Yuan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05929
Pdf URL: https://arxiv.org/pdf/2412.05929
Copy Paste: [[2412.05929]] Enhanced 3D Generation by 2D Editing(https://arxiv.org/abs/2412.05929)
Keywords: diffusion
Abstract: Distilling 3D representations from pretrained 2D diffusion models is essential for 3D creative applications across gaming, film, and interior design. Current SDS-based methods are hindered by inefficient information distillation from diffusion models, which prevents the creation of photorealistic 3D contents. Our research reevaluates the SDS approach by analyzing its fundamental nature as a basic image editing process that commonly results in over-saturation, over-smoothing and lack of rich content due to the poor-quality single-step denoising. To address these limitations, we propose GE3D (3D Generation by Editing). Each iteration of GE3D utilizes a 2D editing framework that combines a noising trajectory to preserve the information of the input image, alongside a text-guided denoising trajectory. We optimize the process by aligning the latents across both trajectories. This approach fully exploits pretrained diffusion models to distill multi-granularity information through multiple denoising steps, resulting in photorealistic 3D outputs. Both theoretical and experimental results confirm the effectiveness of our approach, which not only advances 3D generation technology but also establishes a novel connection between 3D generation and 2D editing. This could potentially inspire further research in the field. Code and demos are released at this https URL.

Title: Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation

Authors: Yiren Song, Shengtao Lou, Xiaokang Liu, Hai Ci, Pei Yang, Jiaming Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05980
Pdf URL: https://arxiv.org/pdf/2412.05980
Copy Paste: [[2412.05980]] Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation(https://arxiv.org/abs/2412.05980)
Keywords: diffusion, generative
Abstract: Diffusion models have revolutionized generative modeling with their exceptional ability to produce high-fidelity images. However, misuse of such potent tools can lead to the creation of fake news or disturbing content targeting individuals, resulting in significant social harm. In this paper, we introduce Anti-Reference, a novel method that protects images from the threats posed by reference-based generation techniques by adding imperceptible adversarial noise to the images. We propose a unified loss function that enables joint attacks on fine-tuning-based customization methods, non-fine-tuning customization methods, and human-centric driving methods. Based on this loss, we train a Adversarial Noise Encoder to predict the noise or directly optimize the noise using the PGD method. Our method shows certain transfer attack capabilities, effectively challenging both gray-box models and some commercial APIs. Extensive experiments validate the performance of Anti-Reference, establishing a new benchmark in image security.

Title: Nested Diffusion Models Using Hierarchical Latent Priors

Authors: Xiao Zhang, Ruoxi Jiang, Rebecca Willett, Michael Maire
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05984
Pdf URL: https://arxiv.org/pdf/2412.05984
Copy Paste: [[2412.05984]] Nested Diffusion Models Using Hierarchical Latent Priors(https://arxiv.org/abs/2412.05984)
Keywords: diffusion, generative
Abstract: We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.

Title: Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation

Authors: Aymen Sekhri, Seyed Ali Amirshahi, Mohamed-Chaker Larabi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.06003
Pdf URL: https://arxiv.org/pdf/2412.06003
Copy Paste: [[2412.06003]] Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation(https://arxiv.org/abs/2412.06003)
Keywords: self-supervised
Abstract: Augmented Reality (AR) is a major immersive media technology that enriches our perception of reality by overlaying digital content (the foreground) onto physical environments (the background). It has far-reaching applications, from entertainment and gaming to education, healthcare, and industrial training. Nevertheless, challenges such as visual confusion and classical distortions can result in user discomfort when using the technology. Evaluating AR quality of experience becomes essential to measure user satisfaction and engagement, facilitating the refinement necessary for creating immersive and robust experiences. Though, the scarcity of data and the distinctive characteristics of AR technology render the development of effective quality assessment metrics challenging. This paper presents a deep learning-based objective metric designed specifically for assessing image quality for AR scenarios. The approach entails four key steps, (1) fine-tuning a self-supervised pre-trained vision transformer to extract prominent features from reference images and distilling this knowledge to improve representations of distorted images, (2) quantifying distortions by computing shift representations, (3) employing cross-attention-based decoders to capture perceptual quality features, and (4) integrating regularization techniques and label smoothing to address the overfitting problem. To validate the proposed approach, we conduct extensive experiments on the ARIQA dataset. The results showcase the superior performance of our proposed approach across all model variants, namely TransformAR, TransformAR-KD, and TransformAR-KD+ in comparison to existing state-of-the-art methods.

Title: Post-hoc Probabilistic Vision-Language Models

Authors: Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06014
Pdf URL: https://arxiv.org/pdf/2412.06014
Copy Paste: [[2412.06014]] Post-hoc Probabilistic Vision-Language Models(https://arxiv.org/abs/2412.06014)
Keywords: generative
Abstract: Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

Title: siForest: Detecting Network Anomalies with Set-Structured Isolation Forest

Authors: Christie Djidjev
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.06015
Pdf URL: https://arxiv.org/pdf/2412.06015
Copy Paste: [[2412.06015]] siForest: Detecting Network Anomalies with Set-Structured Isolation Forest(https://arxiv.org/abs/2412.06015)
Keywords: anomaly
Abstract: As cyber threats continue to evolve in sophistication and scale, the ability to detect anomalous network behavior has become critical for maintaining robust cybersecurity defenses. Modern cybersecurity systems face the overwhelming challenge of analyzing billions of daily network interactions to identify potential threats, making efficient and accurate anomaly detection algorithms crucial for network defense. This paper investigates the use of variations of the Isolation Forest (iForest) machine learning algorithm for detecting anomalies in internet scan data. In particular, it presents the Set-Partitioned Isolation Forest (siForest), a novel extension of the iForest method designed to detect anomalies in set-structured data. By treating instances such as sets of multiple network scans with the same IP address as cohesive units, siForest effectively addresses some challenges of analyzing complex, multidimensional datasets. Extensive experiments on synthetic datasets simulating diverse anomaly scenarios in network traffic demonstrate that siForest has the potential to outperform traditional approaches on some types of internet scan data.

Title: Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Authors: Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, Duygu Ceylan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06016
Pdf URL: https://arxiv.org/pdf/2412.06016
Copy Paste: [[2412.06016]] Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation(https://arxiv.org/abs/2412.06016)
Keywords: diffusion
Abstract: While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Project page: this http URL

Title: FlexDiT: Dynamic Token Density Control for Diffusion Transformer

Authors: Shuning Chang, Pichao Wang, Jiasheng Tang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06028
Pdf URL: https://arxiv.org/pdf/2412.06028
Copy Paste: [[2412.06028]] FlexDiT: Dynamic Token Density Control for Diffusion Transformer(https://arxiv.org/abs/2412.06028)
Keywords: diffusion, generative
Abstract: Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands due to both the quadratic complexity of token-based self-attention and the need for extensive sampling steps. While recent research has focused on accelerating sampling, the structural inefficiencies of DiT remain underexplored. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions to achieve computational efficiency without compromising generation quality. Spatially, FlexDiT employs a three-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, FlexDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between FlexDiT's spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on DiT-XL with only a 0.09 increase in FID score on 512$\times$512 ImageNet images, a 56% reduction in FLOPs across video generation datasets including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, and a 69% improvement in inference speed on PixArt-$\alpha$ on text-to-image generation task with a 0.24 FID score decrease. FlexDiT provides a scalable solution for high-quality diffusion-based generation compatible with further sampling optimization techniques.

Title: Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Authors: Zhenghong Zhou, Jie An, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06029
Pdf URL: https://arxiv.org/pdf/2412.06029
Copy Paste: [[2412.06029]] Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training(https://arxiv.org/abs/2412.06029)
Keywords: diffusion
Abstract: Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and can disrupt the pre-trained model distribution. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model latent space, ensuring high-quality video generation. Experimental results demonstrate that Latent-Reframe achieves comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.

Title: Perceptual Hash Inversion Attacks on Image-Based Sexual Abuse Removal Tools

Authors: Sophie Hawkes, Christian Weinert, Teresa Almeida, Maryam Mehrnezhad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.06056
Pdf URL: https://arxiv.org/pdf/2412.06056
Copy Paste: [[2412.06056]] Perceptual Hash Inversion Attacks on Image-Based Sexual Abuse Removal Tools(https://arxiv.org/abs/2412.06056)
Keywords: generative
Abstract: We show that perceptual hashing, crucial for detecting and removing image-based sexual abuse (IBSA) online, faces vulnerabilities from low-budget inversion attacks based on generative AI. This jeopardizes the privacy of users, especially vulnerable groups. We advocate to implement secure hash matching in IBSA removal tools to mitigate potentially fatal consequences.

Title: Are foundation models for computer vision good conformal predictors?

Authors: Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06082
Pdf URL: https://arxiv.org/pdf/2412.06082
Copy Paste: [[2412.06082]] Are foundation models for computer vision good conformal predictors?(https://arxiv.org/abs/2412.06082)
Keywords: foundation model
Abstract: Recent advances in self-supervision and constrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behavior of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. Furthermore, we show that calibrating the confidence predictions of these models leads to efficiency degradation of the conformal set on adaptive CP methods. In contrast, few-shot adaptation to downstream tasks generally enhances conformal scores, where we identify Adapters as a better conformable alternative compared to Prompt Learning strategies. Our empirical study identifies APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage property across multiple challenging, yet realistic scenarios.

Title: GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Authors: Ashish Goswami, Satyam Kumar Modi, Santhosh Rishi Deshineni, Harman Singh, Prathosh A. P, Parag Singla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06089
Pdf URL: https://arxiv.org/pdf/2412.06089
Copy Paste: [[2412.06089]] GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis(https://arxiv.org/abs/2412.06089)
Keywords: diffusion
Abstract: Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. $\href{this https URL}{this https URL}$

Title: SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Authors: Qiyu Liao, Xin Yuan, Min Xu, Dadong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06138
Pdf URL: https://arxiv.org/pdf/2412.06138
Copy Paste: [[2412.06138]] SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation(https://arxiv.org/abs/2412.06138)
Keywords: diffusion, generative
Abstract: In Fine-Grained Visual Classification (FGVC), distinguishing highly similar subcategories remains a formidable challenge, often necessitating datasets with extensive variability. The acquisition and annotation of such FGVC datasets are notably difficult and costly, demanding specialized knowledge to identify subtle distinctions among closely related categories. Our study introduces a novel approach employing the Sequence Latent Diffusion Model (SLDM) for augmenting FGVC datasets, called Sequence Generative Image Augmentation (SGIA). Our method features a unique Bridging Transfer Learning (BTL) process, designed to minimize the domain gap between real and synthetically augmented data. This approach notably surpasses existing methods in generating more realistic image samples, providing a diverse range of pose transformations that extend beyond the traditional rigid transformations and style changes in generative augmentation. We demonstrate the effectiveness of our augmented dataset with substantial improvements in FGVC tasks on various datasets, models, and training strategies, especially in few-shot learning scenarios. Our method outperforms conventional image augmentation techniques in benchmark tests on three FGVC datasets, showcasing superior realism, variability, and representational quality. Our work sets a new benchmark and outperforms the previous state-of-the-art models in classification accuracy by 0.5% for the CUB-200-2011 dataset and advances the application of generative models in FGVC data augmentation.

Title: Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters

Authors: Yuan Wang, Ouxiang Li, Tingting Mu, Yanbin Hao, Kuien Liu, Xiang Wang, Xiangnan He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06143
Pdf URL: https://arxiv.org/pdf/2412.06143
Copy Paste: [[2412.06143]] Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters(https://arxiv.org/abs/2412.06143)
Keywords: diffusion
Abstract: The success of text-to-image generation enabled by diffuion models has imposed an urgent need to erase unwanted concepts, e.g., copyrighted, offensive, and unsafe ones, from the pre-trained models in a precise, timely, and low-cost manner. The twofold demand of concept erasure requires a precise removal of the target concept during generation (i.e., erasure efficacy), while a minimal impact on non-target content generation (i.e., prior preservation). Existing methods are either computationally costly or face challenges in maintaining an effective balance between erasure efficacy and prior preservation. To improve, we propose a precise, fast, and low-cost concept erasure method, called Adaptive Vaule Decomposer (AdaVD), which is training-free. This method is grounded in a classical linear algebraic orthogonal complement operation, implemented in the value space of each cross-attention layer within the UNet of diffusion models. An effective shift factor is designed to adaptively navigate the erasure strength, enhancing prior preservation without sacrificing erasure efficacy. Extensive experimental results show that the proposed AdaVD is effective at both single and multiple concept erasure, showing a 2- to 10-fold improvement in prior preservation as compared to the second best, meanwhile achieving the best or near best erasure efficacy, when comparing with both training-based and training-free state of the arts. AdaVD supports a series of diffusion models and downstream image generation tasks, the code is available on the project page: this https URL

Title: ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance

Authors: Yuming Li, Peidong Jia, Daiwei Hong, Yueru Jia, Qi She, Rui Zhao, Ming Lu, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06163
Pdf URL: https://arxiv.org/pdf/2412.06163
Copy Paste: [[2412.06163]] ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance(https://arxiv.org/abs/2412.06163)
Keywords: diffusion
Abstract: Training-free high-resolution (HR) image generation has garnered significant attention due to the high costs of training large diffusion models. Most existing methods begin by reconstructing the overall structure and then proceed to refine the local details. Despite their advancements, they still face issues with repetitive patterns in HR image generation. Besides, HR generation with diffusion models incurs significant computational costs. Thus, parallel generation is essential for interactive applications. To solve the above limitations, we introduce a novel method named ASGDiffusion for parallel HR generation with Asynchronous Structure Guidance (ASG) using pre-trained diffusion models. To solve the pattern repetition problem of HR image generation, ASGDiffusion leverages the low-resolution (LR) noise weighted by the attention mask as the structure guidance for the denoising step to ensure semantic consistency. The proposed structure guidance can significantly alleviate the pattern repetition problem. To enable parallel generation, we further propose a parallelism strategy, which calculates the patch noises and structure guidance asynchronously. By leveraging multi-GPU parallel acceleration, we significantly accelerate generation speed and reduce memory usage per GPU. Extensive experiments demonstrate that our method effectively and efficiently addresses common issues like pattern repetition and achieves state-of-the-art HR generation.

Title: Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

Authors: Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06171
Pdf URL: https://arxiv.org/pdf/2412.06171
Copy Paste: [[2412.06171]] Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity(https://arxiv.org/abs/2412.06171)
Keywords: anomaly
Abstract: How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at this https URL.

Title: Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

Authors: Seungtae Nam, Xiangyu Sun, Gyeongjin Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06234
Pdf URL: https://arxiv.org/pdf/2412.06234
Copy Paste: [[2412.06234]] Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction(https://arxiv.org/abs/2412.06234)
Keywords: generative
Abstract: Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.

Title: VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

Authors: Michael Yeung, Toya Teramoto, Songtao Wu, Tatsuo Fujiwara, Kenji Suzuki, Tamaki Kojima
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06235
Pdf URL: https://arxiv.org/pdf/2412.06235
Copy Paste: [[2412.06235]] VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition(https://arxiv.org/abs/2412.06235)
Keywords: diffusion
Abstract: The use of large-scale, web-scraped datasets to train face recognition models has raised significant privacy and bias concerns. Synthetic methods mitigate these concerns and provide scalable and controllable face generation to enable fair and accurate face recognition. However, existing synthetic datasets display limited intraclass and interclass diversity and do not match the face recognition performance obtained using real datasets. Here, we propose VariFace, a two-stage diffusion-based pipeline to create fair and diverse synthetic face datasets to train face recognition models. Specifically, we introduce three methods: Face Recognition Consistency to refine demographic labels, Face Vendi Score Guidance to improve interclass diversity, and Divergence Score Conditioning to balance the identity preservation-intraclass diversity trade-off. When constrained to the same dataset size, VariFace considerably outperforms previous synthetic datasets (0.9200 $\rightarrow$ 0.9405) and achieves comparable performance to face recognition models trained with real data (Real Gap = -0.0065). In an unconstrained setting, VariFace not only consistently achieves better performance compared to previous synthetic methods across dataset sizes but also, for the first time, outperforms the real dataset (CASIA-WebFace) across six evaluation datasets. This sets a new state-of-the-art performance with an average face verification accuracy of 0.9567 (Real Gap = +0.0097) across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets and 0.9366 (Real Gap = +0.0380) on the RFW dataset.

Title: U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening

Authors: Sungpyo Kim, Jeonghyeok Do, Jaehyup Lee, Munchurl Kim
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.06243
Pdf URL: https://arxiv.org/pdf/2412.06243
Copy Paste: [[2412.06243]] U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening(https://arxiv.org/abs/2412.06243)
Keywords: diffusion
Abstract: Conventional methods for PAN-sharpening often struggle to restore fine details due to limitations in leveraging high-frequency information. Moreover, diffusion-based approaches lack sufficient conditioning to fully utilize Panchromatic (PAN) images and low-resolution multispectral (LRMS) inputs effectively. To address these challenges, we propose an uncertainty-aware knowledge distillation diffusion framework with details enhancement for PAN-sharpening, called U-Know-DiffPAN. The U-Know-DiffPAN incorporates uncertainty-aware knowledge distillation for effective transfer of feature details from our teacher model to a student one. The teacher model in our U-Know-DiffPAN captures frequency details through freqeuncy selective attention, facilitating accurate reverse process learning. By conditioning the encoder on compact vector representations of PAN and LRMS and the decoder on Wavelet transforms, we enable rich frequency utilization. So, the high-capacity teacher model distills frequency-rich features into a lightweight student model aided by an uncertainty map. From this, the teacher model can guide the student model to focus on difficult image regions for PAN-sharpening via the usage of the uncertainty map. Extensive experiments on diverse datasets demonstrate the robustness and superior performance of our U-Know-DiffPAN over very recent state-of-the-art PAN-sharpening methods.

Title: A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension

Authors: Saahith Janapati, Yangfeng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06245
Pdf URL: https://arxiv.org/pdf/2412.06245
Copy Paste: [[2412.06245]] A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension(https://arxiv.org/abs/2412.06245)
Keywords: in-context
Abstract: The performance of Large Language Models (LLMs) on natural language tasks can be improved through both supervised fine-tuning (SFT) and in-context learning (ICL), which operate via distinct mechanisms. Supervised fine-tuning updates the model's weights by minimizing loss on training data, whereas in-context learning leverages task demonstrations embedded in the prompt, without changing the model's parameters. This study investigates the effects of these learning paradigms on the hidden representations of LLMs using Intrinsic Dimension (ID). We use ID to estimate the number of degrees of freedom between representations extracted from LLMs as they perform specific natural language tasks. We first explore how the ID of LLM representations evolves during SFT and how it varies due to the number of demonstrations in ICL. We then compare the IDs induced by SFT and ICL and find that ICL consistently induces a higher ID compared to SFT, suggesting that representations generated during ICL reside in higher dimensional manifolds in the embedding space.

Title: Rendering-Refined Stable Diffusion for Privacy Compliant Synthetic Data

Authors: Kartik Patwari, David Schneider, Xiaoxiao Sun, Chen-Nee Chuah, Lingjuan Lyu, Vivek Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06248
Pdf URL: https://arxiv.org/pdf/2412.06248
Copy Paste: [[2412.06248]] Rendering-Refined Stable Diffusion for Privacy Compliant Synthetic Data(https://arxiv.org/abs/2412.06248)
Keywords: diffusion
Abstract: Growing privacy concerns and regulations like GDPR and CCPA necessitate pseudonymization techniques that protect identity in image datasets. However, retaining utility is also essential. Traditional methods like masking and blurring degrade quality and obscure critical context, especially in human-centric images. We introduce Rendering-Refined Stable Diffusion (RefSD), a pipeline that combines 3D-rendering with Stable Diffusion, enabling prompt-based control over human attributes while preserving posture. Unlike standard diffusion models that fail to retain posture or GANs that lack realism and flexible attribute control, RefSD balances posture preservation, realism, and customization. We also propose HumanGenAI, a framework for human perception and utility evaluation. Human perception assessments reveal attribute-specific strengths and weaknesses of RefSD. Our utility experiments show that models trained on RefSD pseudonymized data outperform those trained on real data in detection tasks, with further performance gains when combining RefSD with real data. For classification tasks, we consistently observe performance improvements when using RefSD data with real data, confirming the utility of our pseudonymized data.

Title: Flow Matching Guide and Code

Authors: Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, Itai Gat
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.06264
Pdf URL: https://arxiv.org/pdf/2412.06264
Copy Paste: [[2412.06264]] Flow Matching Guide and Code(https://arxiv.org/abs/2412.06264)
Keywords: generative
Abstract: Flow Matching (FM) is a recent framework for generative modeling that has achieved state-of-the-art performance across various domains, including image, video, audio, speech, and biological structures. This guide offers a comprehensive and self-contained review of FM, covering its mathematical foundations, design choices, and extensions. By also providing a PyTorch package featuring relevant examples (e.g., image and text generation), this work aims to serve as a resource for both novice and experienced researchers interested in understanding, applying and further developing FM.

Title: Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

Authors: Dongxu Wei, Zhiqi Li, Peidong Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06273
Pdf URL: https://arxiv.org/pdf/2412.06273
Copy Paste: [[2412.06273]] Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction(https://arxiv.org/abs/2412.06273)
Keywords: diffusion
Abstract: Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction. Furthermore, we extend our method with diffusion models, pioneering feed-forward multi-modal generation of 3D driving scenes.

Title: No Annotations for Object Detection in Art through Stable Diffusion

Authors: Patrick Ramos, Nicolas Gonthier, Selina Khan, Yuta Nakashima, Noa Garcia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06286
Pdf URL: https://arxiv.org/pdf/2412.06286
Copy Paste: [[2412.06286]] No Annotations for Object Detection in Art through Stable Diffusion(https://arxiv.org/abs/2412.06286)
Keywords: diffusion
Abstract: Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at this https URL

Title: See Further When Clear: Curriculum Consistency Model

Authors: Yunpeng Liu, Boxiao Liu, Yi Zhang, Xingzhong Hou, Guanglu Song, Yu Liu, Haihang You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06295
Pdf URL: https://arxiv.org/pdf/2412.06295
Copy Paste: [[2412.06295]] See Further When Clear: Curriculum Consistency Model(https://arxiv.org/abs/2412.06295)
Keywords: diffusion
Abstract: Significant advances have been made in the sampling efficiency of diffusion models and flow matching models, driven by Consistency Distillation (CD), which trains a student model to mimic the output of a teacher model at a later timestep. However, we found that the learning complexity of the student model varies significantly across different timesteps, leading to suboptimal performance in this http URL address this issue, we propose the Curriculum Consistency Model (CCM), which stabilizes and balances the learning complexity across timesteps. Specifically, we regard the distillation process at each timestep as a curriculum and introduce a metric based on Peak Signal-to-Noise Ratio (PSNR) to quantify the learning complexity of this curriculum, then ensure that the curriculum maintains consistent learning complexity across different timesteps by having the teacher model iterate more steps when the noise intensity is low. Our method achieves competitive single-step sampling Fréchet Inception Distance (FID) scores of 1.64 on CIFAR-10 and 2.18 on ImageNet this http URL, we have extended our method to large-scale text-to-image models and confirmed that it generalizes well to both diffusion models (Stable Diffusion XL) and flow matching models (Stable Diffusion 3). The generated samples demonstrate improved image-text alignment and semantic structure, since CCM enlarges the distillation step at large timesteps and reduces the accumulated error.

Title: HAIFAI: Human-AI Collaboration for Mental Face Reconstruction

Authors: Florian Strohm, Mihai Bâce, Andreas Bulling
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06323
Pdf URL: https://arxiv.org/pdf/2412.06323
Copy Paste: [[2412.06323]] HAIFAI: Human-AI Collaboration for Mental Face Reconstruction(https://arxiv.org/abs/2412.06323)
Keywords: generative
Abstract: We present HAIFAI - a novel collaborative human-AI system to tackle the challenging task of reconstructing a visual representation of a face that exists only in a person's mind. Users iteratively rank images presented by the AI system based on their resemblance to a mental image. These rankings, in turn, allow the system to extract relevant image features, fuse them into a unified feature vector, and use a generative model to reconstruct the mental image. We also propose an extension called HAIFAI-X that allows users to manually refine and further improve the reconstruction using an easy-to-use slider interface. To avoid the need for tedious human data collection for model training, we introduce a computational user model of human ranking behaviour. For this, we collected a small face ranking dataset through an online crowd-sourcing study containing data from 275 participants. We evaluate HAIFAI and HAIFAI-X in a 12-participant user study and show that HAIFAI outperforms the previous state of the art regarding reconstruction quality, usability, perceived workload, and reconstruction speed. HAIFAI-X achieves even better reconstruction quality at the cost of reduced usability, perceived workload, and increased reconstruction time. We further validate the reconstructions in a subsequent face ranking study with 18 participants and show that HAIFAI-X achieves a new state-of-the-art identification rate of 60.6%. These findings represent a significant advancement towards developing new collaborative intelligent systems capable of reliably and effortlessly reconstructing a user's mental image.

Title: Normalizing Flows are Capable Generative Models

Authors: Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, Josh Susskind
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06329
Pdf URL: https://arxiv.org/pdf/2412.06329
Copy Paste: [[2412.06329]] Normalizing Flows are Capable Generative Models(https://arxiv.org/abs/2412.06329)
Keywords: diffusion, generative
Abstract: Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present \textit{TarFlow}: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at \href{this https URL}{this https URL}.

Title: TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions

Authors: Ilya A. Petrov, Riccardo Marin, Julian Chibane, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06334
Pdf URL: https://arxiv.org/pdf/2412.06334
Copy Paste: [[2412.06334]] TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions(https://arxiv.org/abs/2412.06334)
Keywords: diffusion
Abstract: Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities' tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations into a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones, modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry. The project page is available at: this https URL.

Title: UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts

Authors: Zhen Wan, Yue Ma, Chenyang Qi, Zhiheng Liu, Tao Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06340
Pdf URL: https://arxiv.org/pdf/2412.06340
Copy Paste: [[2412.06340]] UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts(https://arxiv.org/abs/2412.06340)
Keywords: generative
Abstract: In this paper, we present UniPaint, a unified generative space-time video inpainting framework that enables spatial-temporal inpainting and interpolation. Different from existing methods that treat video inpainting and video interpolation as two distinct tasks, we leverage a unified inpainting framework to tackle them and observe that these two tasks can mutually enhance synthesis performance. Specifically, we first introduce a plug-and-play space-time video inpainting adapter, which can be employed in various personalized models. The key insight is to propose a Mixture of Experts (MoE) attention to cover various tasks. Then, we design a spatial-temporal masking strategy during the training stage to mutually enhance each other and improve performance. UniPaint produces high-quality and aesthetically pleasing results, achieving the best quantitative results across various tasks and scale setups. The code and checkpoints will be available soon.

Title: Is Self-Supervision Enough? Benchmarking Foundation Models Against End-to-End Training for Mitotic Figure Classification

Authors: Jonathan Ganz, Jonas Ammeling, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06365
Pdf URL: https://arxiv.org/pdf/2412.06365
Copy Paste: [[2412.06365]] Is Self-Supervision Enough? Benchmarking Foundation Models Against End-to-End Training for Mitotic Figure Classification(https://arxiv.org/abs/2412.06365)
Keywords: foundation model
Abstract: Foundation models (FMs), i.e., models trained on a vast amount of typically unlabeled data, have become popular and available recently for the domain of histopathology. The key idea is to extract semantically rich vectors from any input patch, allowing for the use of simple subsequent classification networks potentially reducing the required amounts of labeled data, and increasing domain robustness. In this work, we investigate to which degree this also holds for mitotic figure classification. Utilizing two popular public mitotic figure datasets, we compared linear probing of five publicly available FMs against models trained on ImageNet and a simple ResNet50 end-to-end-trained baseline. We found that the end-to-end-trained baseline outperformed all FM-based classifiers, regardless of the amount of data provided. Additionally, we did not observe the FM-based classifiers to be more robust against domain shifts, rendering both of the above assumptions incorrect.

Title: Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

Authors: Songkang Wen, Vasilii Feofanov, Jianfeng Zhang
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.06368
Pdf URL: https://arxiv.org/pdf/2412.06368
Copy Paste: [[2412.06368]] Measuring Pre-training Data Quality without Labels for Time Series Foundation Models(https://arxiv.org/abs/2412.06368)
Keywords: foundation model
Abstract: Recently, there has been a growing interest in time series foundation models that generalize across different downstream tasks. A key to strong foundation models is a diverse pre-training dataset, which is particularly challenging to collect for time series classification. In this work, we explore the performance of a contrastive-learning-based foundation model as a function of the data used for pre-training. We introduce contrastive accuracy, a new measure to evaluate the quality of the representation space learned by the foundation model. Our experiments reveal the positive correlation between the proposed measure and the accuracy of the model on a collection of downstream tasks. This suggests that the contrastive accuracy can serve as a criterion to search for time series datasets that can enhance the pre-training and improve thereby the foundation model's generalization.

Title: Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Authors: Joshua Freeman, Chloe Rippe, Edoardo Debenedetti, Maksym Andriushchenko
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06370
Pdf URL: https://arxiv.org/pdf/2412.06370
Copy Paste: [[2412.06370]] Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit(https://arxiv.org/abs/2412.06370)
Keywords: generative
Abstract: Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLMs, we probe the strength of The New York Times's copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.

Title: Exploring the Impact of Synthetic Data on Human Gesture Recognition Tasks Using GANs

Authors: George Kontogiannis, Pantelis Tzamalis, Sotiris Nikoletseas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.06389
Pdf URL: https://arxiv.org/pdf/2412.06389
Copy Paste: [[2412.06389]] Exploring the Impact of Synthetic Data on Human Gesture Recognition Tasks Using GANs(https://arxiv.org/abs/2412.06389)
Keywords: generative
Abstract: In the evolving domain of Human Activity Recognition (HAR) using Internet of Things (IoT) devices, there is an emerging interest in employing Deep Generative Models (DGMs) to address data scarcity, enhance data quality, and improve classification metrics scores. Among these types of models, Generative Adversarial Networks (GANs) have arisen as a powerful tool for generating synthetic data that mimic real-world scenarios with high fidelity. However, Human Gesture Recognition (HGR), a subset of HAR, particularly in healthcare applications, using time series data such as allergic gestures, remains highly unexplored. In this paper, we examine and evaluate the performance of two GANs in the generation of synthetic gesture motion data that compose a part of an open-source benchmark dataset. The data is related to the disease identification domain and healthcare, specifically to allergic rhinitis. We also focus on these AI models' performance in terms of fidelity, diversity, and privacy. Furthermore, we examine the scenario if the synthetic data can substitute real data, in training scenarios and how well models trained on synthetic data can be generalized for the allergic rhinitis gestures. In our work, these gestures are related to 6-axes accelerometer and gyroscope data, serving as multi-variate time series instances, and retrieved from smart wearable devices. To the best of our knowledge, this study is the first to explore the feasibility of synthesizing motion gestures for allergic rhinitis from wearable IoT device data using Generative Adversarial Networks (GANs) and testing their impact on the generalization of gesture recognition systems. It is worth noting that, even if our method has been applied to a specific category of gestures, it is designed to be generalized and can be deployed also to other motion data in the HGR domain.

Title: Generative Lines Matching Models

Authors: Ori Matityahu, Raanan Fattal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06403
Pdf URL: https://arxiv.org/pdf/2412.06403
Copy Paste: [[2412.06403]] Generative Lines Matching Models(https://arxiv.org/abs/2412.06403)
Keywords: diffusion, generative
Abstract: In this paper we identify the source of a singularity in the training loss of key denoising models, that causes the denoiser's predictions to collapse towards the mean of the source or target distributions. This degeneracy creates false basins of attraction, distorting the denoising trajectories and ultimately increasing the number of steps required to sample these models. We circumvent this artifact by leveraging the deterministic ODE-based samplers, offered by certain denoising diffusion and score-matching models, which establish a well-defined change-of-variables between the source and target distributions. Given this correspondence, we propose a new probability flow model, the Lines Matching Model (LMM), which matches globally straight lines interpolating the two distributions. We demonstrate that the flow fields produced by the LMM exhibit notable temporal consistency, resulting in trajectories with excellent straightness scores. Beyond its sampling efficiency, the LMM formulation allows us to enhance the fidelity of the generated samples by integrating domain-specific reconstruction and adversarial losses, and by optimizing its training for the sampling procedure used. Overall, the LMM achieves state-of-the-art FID scores with minimal NFEs on established benchmark datasets: 1.57/1.39 (NFE=1/2) on CIFAR-10, 1.47/1.17 on ImageNet 64x64, and 2.68/1.54 on AFHQ 64x64. Finally, we provide a theoretical analysis showing that the use of optimal transport to relate the two distributions suffers from a curse of dimensionality, where the pairing set size (mini-batch) must scale exponentially with the signal dimension.

Title: Can foundation models actively gather information in interactive environments to test hypotheses?

Authors: Nan Rosemary Ke, Danny P. Sawyer, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.06438
Pdf URL: https://arxiv.org/pdf/2412.06438
Copy Paste: [[2412.06438]] Can foundation models actively gather information in interactive environments to test hypotheses?(https://arxiv.org/abs/2412.06438)
Keywords: foundation model, in-context
Abstract: While problem solving is a standard evaluation task for foundation models, a crucial component of problem solving -- actively and strategically gathering information to test hypotheses -- has not been closely investigated. To assess the information gathering abilities of foundation models in interactive environments, we introduce a framework in which a model must determine the factors influencing a hidden reward function by iteratively reasoning about its previously gathered information and proposing its next exploratory action to maximize information gain at each step. We implement this framework in both a text-based environment, which offers a tightly controlled setting and enables high-throughput parameter sweeps, and in an embodied 3D environment, which requires addressing complexities of multi-modal interaction more relevant to real-world applications. We further investigate whether approaches such as self-correction and increased inference time improve information gathering efficiency. In a relatively simple task that requires identifying a single rewarding feature, we find that LLM's information gathering capability is close to optimal. However, when the model must identify a conjunction of rewarding features, performance is suboptimal. The hit in performance is due partly to the model translating task description to a policy and partly to the model's effectiveness in using its in-context memory. Performance is comparable in both text and 3D embodied environments, although imperfect visual object recognition reduces its accuracy in drawing conclusions from gathered information in the 3D embodied case. For single-feature-based rewards, we find that smaller models curiously perform better; for conjunction-based rewards, incorporating self correction into the model improves performance.

Title: Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Authors: Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06458
Pdf URL: https://arxiv.org/pdf/2412.06458
Copy Paste: [[2412.06458]] Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models(https://arxiv.org/abs/2412.06458)
Keywords: self-supervised
Abstract: Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational cost poses a significant barrier to wider application. To enhance inference efficiency, most existing approaches depend on parameter-dependent or token-dependent strategies to reduce computational demands. However, these methods typically require complex training processes and struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios. The code for this work will be made publicly available.

Title: Gated Delta Networks: Improving Mamba2 with Delta Rule

Authors: Songlin Yang, Jan Kautz, Ali Hatamizadeh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06464
Pdf URL: https://arxiv.org/pdf/2412.06464
Copy Paste: [[2412.06464]] Gated Delta Networks: Improving Mamba2 with Delta Rule(https://arxiv.org/abs/2412.06464)
Keywords: in-context
Abstract: Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

Title: Small Languages, Big Models: A Study of Continual Training on Languages of Norway

Authors: David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06484
Pdf URL: https://arxiv.org/pdf/2412.06484
Copy Paste: [[2412.06484]] Small Languages, Big Models: A Study of Continual Training on Languages of Norway(https://arxiv.org/abs/2412.06484)
Keywords: generative
Abstract: Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Sámi. To address this issue, we present a novel three-stage continual training approach. We also experiment with combining causal and masked language modeling to get more flexible models. Based on our findings, we train, evaluate, and openly release a new large generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

Title: AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

Authors: Shidan He, Lei Liu, Shen Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06510
Pdf URL: https://arxiv.org/pdf/2412.06510
Copy Paste: [[2412.06510]] AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis(https://arxiv.org/abs/2412.06510)
Keywords: anomaly
Abstract: Anomaly synthesis is a crucial approach to augment abnormal data for advancing anomaly inspection. Based on the knowledge from the large-scale pre-training, existing text-to-image anomaly synthesis methods predominantly focus on textual information or coarse-aligned visual features to guide the entire generation process. However, these methods often lack sufficient descriptors to capture the complicated characteristics of realistic anomalies (e.g., the fine-grained visual pattern of anomalies), limiting the realism and generalization of the generation process. To this end, we propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals, which could encode the generalized anomaly cues from text-image reference prompts and improve the realism of synthesized abnormal samples. Specifically, AnomalyControl adopts a flexible and non-matching prompt pair (i.e., a text-image reference prompt and a targeted text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to extract cross-modal semantic features from the textual and visual descriptors. Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to allow CSM to focus on the specific visual patterns of the anomaly, thus enhancing the realism and contextual relevance of the generated anomaly features. Treating cross-modal semantic features as the prior, a Semantic Guided Adapter (SGA) is designed to encode effective guidance signals for the adequate and controllable synthesis process. Extensive experiments indicate that AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods while exhibiting superior performance for downstream tasks.

Title: MoViE: Mobile Diffusion for Video Editing

Authors: Adil Karjauv, Noor Fathima, Ioannis Lelekas, Fatih Porikli, Amir Ghodrati, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06578
Pdf URL: https://arxiv.org/pdf/2412.06578
Copy Paste: [[2412.06578]] MoViE: Mobile Diffusion for Video Editing(https://arxiv.org/abs/2412.06578)
Keywords: diffusion
Abstract: Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at 12 frames per second on mobile devices, while maintaining high quality. Our results are available at this https URL

Title: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

Authors: Tianxin Xie, Yan Rong, Pengfei Zhang, Li Liu
Subjects: cs.CL, cs.AI, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.06602
Pdf URL: https://arxiv.org/pdf/2412.06602
Copy Paste: [[2412.06602]] Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey(https://arxiv.org/abs/2412.06602)
Keywords: diffusion
Abstract: Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. Besides, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industry practitioners.

Title: MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Authors: Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06614
Pdf URL: https://arxiv.org/pdf/2412.06614
Copy Paste: [[2412.06614]] MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences(https://arxiv.org/abs/2412.06614)
Keywords: diffusion
Abstract: Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL$\cdot$E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.

Title: MAVias: Mitigate any Visual Bias

Authors: Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06632
Pdf URL: https://arxiv.org/pdf/2412.06632
Copy Paste: [[2412.06632]] MAVias: Mitigate any Visual Bias(https://arxiv.org/abs/2412.06632)
Keywords: foundation model
Abstract: Mitigating biases in computer vision models is an essential step towards the trustworthiness of artificial intelligence models. Existing bias mitigation methods focus on a small set of predefined biases, limiting their applicability in visual datasets where multiple, possibly unknown biases exist. To address this limitation, we introduce MAVias, an open-set bias mitigation approach leveraging foundation models to discover spurious associations between visual attributes and target classes. MAVias first captures a wide variety of visual features in natural language via a foundation image tagging model, and then leverages a large language model to select those visual features defining the target class, resulting in a set of language-coded potential visual biases. We then translate this set of potential biases into vision-language embeddings and introduce an in-processing bias mitigation approach to prevent the model from encoding information related to them. Our experiments on diverse datasets, including CelebA, Waterbirds, ImageNet, and UrbanCars, show that MAVias effectively detects and mitigates a wide range of biases in visual recognition tasks outperforming current state-of-the-art.

Title: Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers

Authors: Johanna Vielhaben, Dilyara Bareeva, Jim Berend, Wojciech Samek, Nils Strodthoff
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06639
Pdf URL: https://arxiv.org/pdf/2412.06639
Copy Paste: [[2412.06639]] Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers(https://arxiv.org/abs/2412.06639)
Keywords: self-supervised
Abstract: Vision transformers (ViTs) can be trained using various learning paradigms, from fully supervised to self-supervised. Diverse training protocols often result in significantly different feature spaces, which are usually compared through alignment analysis. However, current alignment measures quantify this relationship in terms of a single scalar value, obscuring the distinctions between common and unique features in pairs of representations that share the same scalar alignment. We address this limitation by combining alignment analysis with concept discovery, which enables a breakdown of alignment into single concepts encoded in feature space. This fine-grained comparison reveals both universal and unique concepts across different representations, as well as the internal structure of concepts within each of them. Our methodological contributions address two key prerequisites for concept-based alignment: 1) For a description of the representation in terms of concepts that faithfully capture the geometry of the feature space, we define concepts as the most general structure they can possibly form - arbitrary manifolds, allowing hidden features to be described by their proximity to these manifolds. 2) To measure distances between concept proximity scores of two representations, we use a generalized Rand index and partition it for alignment between pairs of concepts. We confirm the superiority of our novel concept definition for alignment analysis over existing linear baselines in a sanity check. The concept-based alignment analysis of representations from four different ViTs reveals that increased supervision correlates with a reduction in the semantic structure of learned representations.

Title: Detecting Facial Image Manipulations with Multi-Layer CNN Models

Authors: Alejandro Marco Montejano, Angela Sanchez Perez, Javier Barrachina, David Ortiz-Perez, Manuel Benavent-Lledo, Jose Garcia-Rodriguez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06643
Pdf URL: https://arxiv.org/pdf/2412.06643
Copy Paste: [[2412.06643]] Detecting Facial Image Manipulations with Multi-Layer CNN Models(https://arxiv.org/abs/2412.06643)
Keywords: diffusion
Abstract: The rapid evolution of digital image manipulation techniques poses significant challenges for content verification, with models such as stable diffusion and mid-journey producing highly realistic, yet synthetic, images that can deceive human perception. This research develops and evaluates convolutional neural networks (CNNs) specifically tailored for the detection of these manipulated images. The study implements a comparative analysis of three progressively complex CNN architectures, assessing their ability to classify and localize manipulations across various facial image modifications. Regularization and optimization techniques were systematically incorporated to improve feature extraction and performance. The results indicate that the proposed models achieve an accuracy of up to 76\% in distinguishing manipulated images from genuine ones, surpassing traditional approaches. This research not only highlights the potential of CNNs in enhancing the robustness of digital media verification tools, but also provides insights into effective architectural adaptations and training strategies for low-computation environments. Future work will build on these findings by extending the architectures to handle more diverse manipulation techniques and integrating multi-modal data for improved detection capabilities.

Title: Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

Authors: Shuaiting Li, Juncan Deng, Zeyu Wang, Hong Gu, Kedong Xu, Haibin Shen, Kejie Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06661
Pdf URL: https://arxiv.org/pdf/2412.06661
Copy Paste: [[2412.06661]] Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion(https://arxiv.org/abs/2412.06661)
Keywords: diffusion
Abstract: Text-to-image generation of Stable Diffusion models has achieved notable success due to its remarkable generation ability. However, the repetitive denoising process is computationally intensive during inference, which renders Diffusion models less suitable for real-world applications that require low latency and scalability. Recent studies have employed post-training quantization (PTQ) and quantization-aware training (QAT) methods to compress Diffusion models. Nevertheless, prior research has often neglected to examine the consistency between results generated by quantized models and those from floating-point models. This consistency is crucial in fields such as content creation, design, and edge deployment, as it can significantly enhance both efficiency and system stability for practitioners. To ensure that quantized models generate high-quality and consistent images, we propose an efficient quantization framework for Stable Diffusion models. Our approach features a Serial-to-Parallel calibration pipeline that addresses the consistency of both the calibration and inference processes, as well as ensuring training stability. Based on this pipeline, we further introduce a mix-precision quantization strategy, multi-timestep activation quantization, and time information precalculation techniques to ensure high-fidelity generation in comparison to floating-point models. Through extensive experiments with Stable Diffusion v1-4, v2-1, and XL 1.0, we have demonstrated that our method outperforms the current state-of-the-art techniques when tested on prompts from the COCO validation dataset and the Stable-Diffusion-Prompts dataset. Under W4A8 quantization settings, our approach enhances both distribution similarity and visual similarity by 45%-60%.

Title: Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Authors: Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, Aviral Kumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06685
Pdf URL: https://arxiv.org/pdf/2412.06685
Copy Paste: [[2412.06685]] Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone(https://arxiv.org/abs/2412.06685)
Keywords: diffusion
Abstract: Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Title: Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Authors: Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06698
Pdf URL: https://arxiv.org/pdf/2412.06698
Copy Paste: [[2412.06698]] Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy(https://arxiv.org/abs/2412.06698)
Keywords: diffusion
Abstract: Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on this https URL.

Title: You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Authors: Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06699
Pdf URL: https://arxiv.org/pdf/2412.06699
Copy Paste: [[2412.06699]] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale(https://arxiv.org/abs/2412.06699)
Keywords: diffusion
Abstract: Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: this https URL

Title: Facade: High-Precision Insider Threat Detection Using Deep Contextual Anomaly Detection

Authors: Alex Kantchelian, Casper Neo, Ryan Stevens, Hyungwon Kim, Zhaohao Fu, Sadegh Momeni, Birkett Huber, Elie Bursztein, Yanis Pavlidis, Senaka Buthpitiya, Martin Cochran, Massimiliano Poletto
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.06700
Pdf URL: https://arxiv.org/pdf/2412.06700
Copy Paste: [[2412.06700]] Facade: High-Precision Insider Threat Detection Using Deep Contextual Anomaly Detection(https://arxiv.org/abs/2412.06700)
Keywords: anomaly
Abstract: We present Facade (Fast and Accurate Contextual Anomaly DEtection): a high-precision deep-learning-based anomaly detection system deployed at Google (a large technology company) as the last line of defense against insider threats since 2018. Facade is an innovative unsupervised action-context system that detects suspicious actions by considering the context surrounding each action, including relevant facts about the user and other entities involved. It is built around a new multi-modal model that is trained on corporate document access, SQL query, and HTTP/RPC request logs. To overcome the scarcity of incident data, Facade harnesses a novel contrastive learning strategy that relies solely on benign data. Its use of history and implicit social network featurization efficiently handles the frequent out-of-distribution events that occur in a rapidly changing corporate environment, and sustains Facade's high precision performance for a full year after training. Beyond the core model, Facade contributes an innovative clustering approach based on user and action embeddings to improve detection robustness and achieve high precision, multi-scale detection. Functionally what sets Facade apart from existing anomaly detection systems is its high precision. It detects insider attackers with an extremely low false positive rate, lower than 0.01%. For single rogue actions, such as the illegitimate access to a sensitive document, the false positive rate is as low as 0.0003%. To the best of our knowledge, Facade is the only published insider risk anomaly detection system that helps secure such a large corporate environment.

Title: Parkinson's Disease Diagnosis Through Deep Learning: A Novel LSTM-Based Approach for Freezing of Gait Detection

Authors: Aqib Nazir Mir, Iqra Nissar, Mumtaz Ahmed, Sarfaraz Masood, Danish Raza Rizvi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06709
Pdf URL: https://arxiv.org/pdf/2412.06709
Copy Paste: [[2412.06709]] Parkinson's Disease Diagnosis Through Deep Learning: A Novel LSTM-Based Approach for Freezing of Gait Detection(https://arxiv.org/abs/2412.06709)
Keywords: generative
Abstract: Deep learning holds tremendous potential in healthcare for uncovering hidden patterns within extensive clinical datasets, aiding in the diagnosis of various diseases. Parkinson's disease (PD) is a neurodegenerative condition characterized by the deterioration of brain function. In the initial stages of PD, automatic diagnosis poses a challenge due to the similarity in behavior between individuals with PD and those who are healthy. Our objective is to propose an effective model that can aid in the early detection of Parkinson's disease. We employed the VGRF gait signal dataset sourced from Physionet for distinguishing between healthy individuals and those diagnosed with Parkinson's disease. This paper introduces a novel deep learning architecture based on the LSTM network for automatically detecting freezing of gait episodes in Parkinson's disease patients. In contrast to conventional machine learning algorithms, this method eliminates manual feature engineering and proficiently captures prolonged temporal dependencies in gait patterns, thereby improving the diagnosis of Parkinson's disease. The LSTM network resolves the issue of vanishing gradients by employing memory blocks in place of self-connected hidden units, allowing for optimal information assimilation. To prevent overfitting, dropout and L2 regularization techniques have been employed. Additionally, the stochastic gradient-based optimizer Adam is used for the optimization process. The results indicate that our proposed approach surpasses current state-of-the-art models in FOG episode detection, achieving an accuracy of 97.71%, sensitivity of 99%, precision of 98%, and specificity of 96%. This demonstrates its potential as a superior classification method for Parkinson's disease detection.

Title: How to Merge Your Multimodal Models Over Time?

Authors: Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth, Ameya Prabhu, Zeynep Akata, Samuel Albanie, Matthias Bethge
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.06712
Pdf URL: https://arxiv.org/pdf/2412.06712
Copy Paste: [[2412.06712]] How to Merge Your Multimodal Models Over Time?(https://arxiv.org/abs/2412.06712)
Keywords: foundation model
Abstract: Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

Title: Take Fake as Real: Realistic-like Robust Black-box Adversarial Attack to Evade AIGC Detection

Authors: Caiyun Xie, Dengpan Ye, Yunming Zhang, Long Tang, Yunna Lv, Jiacheng Deng, Jiawei Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06727
Pdf URL: https://arxiv.org/pdf/2412.06727
Copy Paste: [[2412.06727]] Take Fake as Real: Realistic-like Robust Black-box Adversarial Attack to Evade AIGC Detection(https://arxiv.org/abs/2412.06727)
Keywords: diffusion
Abstract: The security of AI-generated content (AIGC) detection based on GANs and diffusion models is closely related to the credibility of multimedia content. Malicious adversarial attacks can evade these developing AIGC detection. However, most existing adversarial attacks focus only on GAN-generated facial images detection, struggle to be effective on multi-class natural images and diffusion-based detectors, and exhibit poor invisibility. To fill this gap, we first conduct an in-depth analysis of the vulnerability of AIGC detectors and discover the feature that detectors vary in vulnerability to different post-processing. Then, considering the uncertainty of detectors in real-world scenarios, and based on the discovery, we propose a Realistic-like Robust Black-box Adversarial attack (R$^2$BA) with post-processing fusion optimization. Unlike typical perturbations, R$^2$BA uses real-world post-processing, i.e., Gaussian blur, JPEG compression, Gaussian noise and light spot to generate adversarial examples. Specifically, we use a stochastic particle swarm algorithm with inertia decay to optimize post-processing fusion intensity and explore the detector's decision boundary. Guided by the detector's fake probability, R$^2$BA enhances/weakens the detector-vulnerable/detector-robust post-processing intensity to strike a balance between adversariality and invisibility. Extensive experiments on popular/commercial AIGC detectors and datasets demonstrate that R$^2$BA exhibits impressive anti-detection performance, excellent invisibility, and strong robustness in GAN-based and diffusion-based cases. Compared to state-of-the-art white-box and black-box attacks, R$^2$BA shows significant improvements of 15% and 21% in anti-detection performance under the original and robust scenario respectively, offering valuable insights for the security of AIGC detection in real-world applications.

Title: ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet

Authors: Andrei-Robert Alexandrescu, Razvan-Gabriel Petec, Alexandru Manole, Laura-Silvia Diosan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06742
Pdf URL: https://arxiv.org/pdf/2412.06742
Copy Paste: [[2412.06742]] ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet(https://arxiv.org/abs/2412.06742)
Keywords: diffusion
Abstract: Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever-increasing sub-field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state-of-the-art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi-modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail-specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.

Title: ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Authors: Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.06745
Pdf URL: https://arxiv.org/pdf/2412.06745
Copy Paste: [[2412.06745]] ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities(https://arxiv.org/abs/2412.06745)
Keywords: foundation model
Abstract: Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

Title: InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention

Authors: Howard Zhang, Yuval Alaluf, Sizhuo Ma, Achuta Kadambi, Jian Wang, Kfir Aberman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06753
Pdf URL: https://arxiv.org/pdf/2412.06753
Copy Paste: [[2412.06753]] InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention(https://arxiv.org/abs/2412.06753)
Keywords: diffusion
Abstract: Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.

Title: Visual Lexicon: Rich Image Features in Language Space

Authors: XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06774
Pdf URL: https://arxiv.org/pdf/2412.06774
Copy Paste: [[2412.06774]] Visual Lexicon: Rich Image Features in Language Space(https://arxiv.org/abs/2412.06774)
Keywords: diffusion, self-supervised
Abstract: We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.

Title: Diverse Score Distillation

Authors: Yanbo Xu, Jayanth Srinivasa, Gaowen Liu, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06780
Pdf URL: https://arxiv.org/pdf/2412.06780
Copy Paste: [[2412.06780]] Diverse Score Distillation(https://arxiv.org/abs/2412.06780)
Keywords: diffusion
Abstract: Score distillation of 2D diffusion models has proven to be a powerful mechanism to guide 3D optimization, for example enabling text-based 3D generation or single-view reconstruction. A common limitation of existing score distillation formulations, however, is that the outputs of the (mode-seeking) optimization are limited in diversity despite the underlying diffusion model being capable of generating diverse samples. In this work, inspired by the sampling process in denoising diffusion, we propose a score formulation that guides the optimization to follow generation paths defined by random initial seeds, thus ensuring diversity. We then present an approximation to adopt this formulation for scenarios where the optimization may not precisely follow the generation paths (e.g. a 3D representation whose renderings evolve in a co-dependent manner). We showcase the applications of our `Diverse Score Distillation' (DSD) formulation across tasks such as 2D optimization, text-based 3D inference, and single-view reconstruction. We also empirically validate DSD against prior score distillation formulations and show that it significantly improves sample diversity while preserving fidelity.

Title: Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Authors: Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06781
Pdf URL: https://arxiv.org/pdf/2412.06781
Copy Paste: [[2412.06781]] Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation(https://arxiv.org/abs/2412.06781)
Keywords: diffusion, generative
Abstract: Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.

Title: Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

Authors: Ruihan Gao, Kangle Deng, Gengshan Yang, Wenzhen Yuan, Jun-Yan Zhu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06785
Pdf URL: https://arxiv.org/pdf/2412.06785
Copy Paste: [[2412.06785]] Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation(https://arxiv.org/abs/2412.06785)
Keywords: diffusion
Abstract: 3D generation methods have shown visually compelling results powered by diffusion image priors. However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.

Title: Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

Authors: M. Hamza Mughal, Rishabh Dabral, Merel C.J. Scholman, Vera Demberg, Christian Theobalt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06786
Pdf URL: https://arxiv.org/pdf/2412.06786
Copy Paste: [[2412.06786]] Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis(https://arxiv.org/abs/2412.06786)
Keywords: diffusion
Abstract: Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.

Title: [MASK] is All You Need

Authors: Vincent Tao Hu, Björn Ommer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06787
Pdf URL: https://arxiv.org/pdf/2412.06787
Copy Paste: [[2412.06787]] [MASK] is All You Need(https://arxiv.org/abs/2412.06787)
Keywords: diffusion, generative
Abstract: In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK]tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks.