2024-12-02

Title: OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

Authors: Yixuan Zhang, Hui Yang, Chuanchen Luo, Junran Peng, Yuxi Wang, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18660
Pdf URL: https://arxiv.org/pdf/2411.18660
Copy Paste: [[2411.18660]] OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains(https://arxiv.org/abs/2411.18660)
Keywords: diffusion
Abstract: Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.

Title: HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior

Authors: Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin C.K. Chan, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18662
Pdf URL: https://arxiv.org/pdf/2411.18662
Copy Paste: [[2411.18662]] HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior(https://arxiv.org/abs/2411.18662)
Keywords: diffusion
Abstract: Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.

Title: Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Authors: Junha Hyung, Kinam Kim, Susung Hong, Min-Jung Kim, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18664
Pdf URL: https://arxiv.org/pdf/2411.18664
Copy Paste: [[2411.18664]] Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling(https://arxiv.org/abs/2411.18664)
Keywords: diffusion
Abstract: Diffusion models have emerged as a powerful tool for generating high-quality images, videos, and 3D content. While sampling guidance techniques like CFG improve quality, they reduce diversity and motion. Autoguidance mitigates these issues but demands extra weak model training, limiting its practicality for large-scale models. In this work, we introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models. STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include: (1) introducing STG as an efficient, high-performing guidance technique for video diffusion models, (2) eliminating the need for auxiliary models by simulating a weak model through layer skipping, and (3) ensuring quality-enhanced guidance without compromising sample diversity or dynamics unlike CFG. For additional results, visit this https URL.

Title: SpotLight: Shadow-Guided Object Relighting via Diffusion

Authors: Frédéric Fortier-Chouinard, Zitian Zhang, Louis-Etienne Messier, Mathieu Garon, Anand Bhattad, Jean-François Lalonde
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.18665
Pdf URL: https://arxiv.org/pdf/2411.18665
Copy Paste: [[2411.18665]] SpotLight: Shadow-Guided Object Relighting via Diffusion(https://arxiv.org/abs/2411.18665)
Keywords: diffusion
Abstract: Recent work has shown that diffusion models can be used as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. Unlike typical physics-based renderers, however, neural rendering engines are limited by the lack of manual control over the lighting setup, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise lighting control can be achieved for object relighting simply by specifying the desired shadows of the object. Rather surprisingly, we show that injecting only the shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, leverages existing neural rendering approaches and achieves controllable relighting results with no additional training. Specifically, we demonstrate its use with two neural renderers from the recent literature. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting.

Title: Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

Authors: Hao Liu, Minglin Chen, Yanni Ma, Haihong Xiao, Ying He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18667
Pdf URL: https://arxiv.org/pdf/2411.18667
Copy Paste: [[2411.18667]] Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting(https://arxiv.org/abs/2411.18667)
Keywords: self-supervised
Abstract: Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS$^3$ to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS$^3$ framework is highly efficient, achieving approximately 9$\times$ pre-training speedup and less than 0.25$\times$ memory cost compared to the previous rendering-based framework Ponder.

Title: Towards Chunk-Wise Generation for Long Videos

Authors: Siyang Zhang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18668
Pdf URL: https://arxiv.org/pdf/2411.18668
Copy Paste: [[2411.18668]] Towards Chunk-Wise Generation for Long Videos(https://arxiv.org/abs/2411.18668)
Keywords: diffusion, generative
Abstract: Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient $k$-step search solution to mitigate these problems.

Title: SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Authors: Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Qifeng Chen, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18669
Pdf URL: https://arxiv.org/pdf/2411.18669
Copy Paste: [[2411.18669]] SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality(https://arxiv.org/abs/2411.18669)
Keywords: foundation model
Abstract: Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at this https URL

Title: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18673
Pdf URL: https://arxiv.org/pdf/2411.18673
Copy Paste: [[2411.18673]] AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers(https://arxiv.org/abs/2411.18673)
Keywords: diffusion, generative
Abstract: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.

Title: Active Data Curation Effectively Distills Large-Scale Multimodal Models

Authors: Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, Olivier J. Hénaff
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18674
Pdf URL: https://arxiv.org/pdf/2411.18674
Copy Paste: [[2411.18674]] Active Data Curation Effectively Distills Large-Scale Multimodal Models(https://arxiv.org/abs/2411.18674)
Keywords: generative
Abstract: Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.

Title: MatchDiffusion: Training-free Generation of Match-cuts

Authors: Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18677
Pdf URL: https://arxiv.org/pdf/2411.18677
Copy Paste: [[2411.18677]] MatchDiffusion: Training-free Generation of Match-cuts(https://arxiv.org/abs/2411.18677)
Keywords: diffusion
Abstract: Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.

Title: Random Walks with Tweedie: A Unified Framework for Diffusion Models

Authors: Chicago Y. Park, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.18702
Pdf URL: https://arxiv.org/pdf/2411.18702
Copy Paste: [[2411.18702]] Random Walks with Tweedie: A Unified Framework for Diffusion Models(https://arxiv.org/abs/2411.18702)
Keywords: diffusion, generative
Abstract: We present a simple template for designing generative diffusion model algorithms based on an interpretation of diffusion sampling as a sequence of random walks. Score-based diffusion models are widely used to generate high-quality images. Diffusion models have also been shown to yield state-of-the-art performance in many inverse problems. While these algorithms are often surprisingly simple, the theory behind them is not, and multiple complex theoretical justifications exist in the literature. Here, we provide a simple and largely self-contained theoretical justification for score-based-diffusion models that avoids using the theory of Markov chains or reverse diffusion, instead centering the theory of random walks and Tweedie's formula. This approach leads to unified algorithmic templates for network training and sampling. In particular, these templates cleanly separate training from sampling, e.g., the noise schedule used during training need not match the one used during sampling. We show that several existing diffusion models correspond to particular choices within this template and demonstrate that other, more straightforward algorithmic choices lead to effective diffusion models. The proposed framework has the added benefit of enabling conditional sampling without any likelihood approximation.

Title: Generative Visual Communication in the Era of Vision-Language Models

Authors: Yael Vinker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18727
Pdf URL: https://arxiv.org/pdf/2411.18727
Copy Paste: [[2411.18727]] Generative Visual Communication in the Era of Vision-Language Models(https://arxiv.org/abs/2411.18727)
Keywords: generative
Abstract: Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today's visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models' operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.

Title: Foundation Models in Radiology: What, How, When, Why and Why Not

Authors: Magdalini Paschali, Zhihong Chen, Louis Blankemeier, Maya Varma, Alaa Youssef, Christian Bluethgen, Curtis Langlotz, Sergios Gatidis, Akshay Chaudhari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18730
Pdf URL: https://arxiv.org/pdf/2411.18730
Copy Paste: [[2411.18730]] Foundation Models in Radiology: What, How, When, Why and Why Not(https://arxiv.org/abs/2411.18730)
Keywords: foundation model
Abstract: Recent advances in artificial intelligence have witnessed the emergence of large-scale deep learning models capable of interpreting and generating both textual and imaging data. Such models, typically referred to as foundation models, are trained on extensive corpora of unlabeled data and demonstrate high performance across various tasks. Foundation models have recently received extensive attention from academic, industry, and regulatory bodies. Given the potentially transformative impact that foundation models can have on the field of radiology, this review aims to establish a standardized terminology concerning foundation models, with a specific focus on the requirements of training data, model training paradigms, model capabilities, and evaluation strategies. We further outline potential pathways to facilitate the training of radiology-specific foundation models, with a critical emphasis on elucidating both the benefits and challenges associated with such models. Overall, we envision that this review can unify technical advances and clinical needs in the training of foundation models for radiology in a safe and responsible manner, for ultimately benefiting patients, providers, and radiologists.

Title: DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

Authors: Zheyan Zhang, Diego Klabjan, Renee CB Manworren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18745
Pdf URL: https://arxiv.org/pdf/2411.18745
Copy Paste: [[2411.18745]] DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration(https://arxiv.org/abs/2411.18745)
Keywords: diffusion
Abstract: In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.

Title: Lifting Motion to the 3D World via 2D Diffusion

Authors: Jiaman Li, C. Karen Liu, Jiajun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18808
Pdf URL: https://arxiv.org/pdf/2411.18808
Copy Paste: [[2411.18808]] Lifting Motion to the 3D World via 2D Diffusion(https://arxiv.org/abs/2411.18808)
Keywords: diffusion
Abstract: Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Title: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18810
Pdf URL: https://arxiv.org/pdf/2411.18810
Copy Paste: [[2411.18810]] Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds(https://arxiv.org/abs/2411.18810)
Keywords: diffusion
Abstract: Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{\alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{\alpha}.

Title: FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Authors: Junyang Chen, Jinshan Pan, Jiangxin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18824
Pdf URL: https://arxiv.org/pdf/2411.18824
Copy Paste: [[2411.18824]] FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution(https://arxiv.org/abs/2411.18824)
Keywords: diffusion
Abstract: Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.

Title: EzSQL: An SQL intermediate representation for improving SQL-to-text Generation

Authors: Meher Bhardwaj, Hrishikesh Ethari, Dennis Singh Moirangthem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18923
Pdf URL: https://arxiv.org/pdf/2411.18923
Copy Paste: [[2411.18923]] EzSQL: An SQL intermediate representation for improving SQL-to-text Generation(https://arxiv.org/abs/2411.18923)
Keywords: generative
Abstract: The SQL-to-text generation task traditionally uses template base, Seq2Seq, tree-to-sequence, and graph-to-sequence models. Recent models take advantage of pre-trained generative language models for this task in the Seq2Seq framework. However, treating SQL as a sequence of inputs to the pre-trained models is not optimal. In this work, we put forward a new SQL intermediate representation called EzSQL to align SQL with the natural language text sequence. EzSQL simplifies the SQL queries and brings them closer to natural language text by modifying operators and keywords, which can usually be described in natural language. EzSQL also removes the need for set operators. Our proposed SQL-to-text generation model uses EzSQL as the input to a pre-trained generative language model for generating the text descriptions. We demonstrate that our model is an effective state-of-the-art method to generate text narrations from SQL queries on the WikiSQL and Spider datasets. We also show that by generating pretraining data using our SQL-to-text generation model, we can enhance the performance of Text-to-SQL parsers.

Title: Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?

Authors: Adrian Tormos, Blanca Llauradó, Fernando Núñez, Axel Romero, Dario Garcia-Gasulla, Javier Béjar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18926
Pdf URL: https://arxiv.org/pdf/2411.18926
Copy Paste: [[2411.18926]] Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?(https://arxiv.org/abs/2411.18926)
Keywords: diffusion, generative
Abstract: The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.

Title: VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

Authors: Sakshi Agarwal, Gabe Hoope, Erik B. Sudderth
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18929
Pdf URL: https://arxiv.org/pdf/2411.18929
Copy Paste: [[2411.18929]] VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference(https://arxiv.org/abs/2411.18929)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.

Title: Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Authors: Weimin Qiu, Jieke Wang, Meng Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18936
Pdf URL: https://arxiv.org/pdf/2411.18936
Copy Paste: [[2411.18936]] Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects(https://arxiv.org/abs/2411.18936)
Keywords: diffusion
Abstract: Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.

Title: ICLERB: In-Context Learning Embedding and Reranker Benchmark

Authors: Marie Al Ghossein, Emile Contal, Alexandre Robicquet
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2411.18947
Pdf URL: https://arxiv.org/pdf/2411.18947
Copy Paste: [[2411.18947]] ICLERB: In-Context Learning Embedding and Reranker Benchmark(https://arxiv.org/abs/2411.18947)
Keywords: in-context
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new tasks by conditioning on prompts with relevant information. Retrieval-Augmented Generation (RAG) enhances ICL by incorporating retrieved documents into the LLM's context at query time. However, traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. In this paper, we propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks. We introduce the In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel evaluation framework that compares retrievers based on their ability to enhance LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune retrieval models using minimal feedback from the LLM. Our experimental results reveal notable differences between ICLERB and existing benchmarks, and demonstrate that small models fine-tuned with our RLRAIF algorithm outperform large state-of-the-art retrieval models. These findings highlight the limitations of existing evaluation methods and the need for specialized benchmarks and training strategies adapted to ICL.

Title: Random Sampling for Diffusion-based Adversarial Purification

Authors: Jiancheng Zhang, Peiran Dong, Yongyong Chen, Yin-Ping Zhao, Song Guo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18956
Pdf URL: https://arxiv.org/pdf/2411.18956
Copy Paste: [[2411.18956]] Random Sampling for Diffusion-based Adversarial Purification(https://arxiv.org/abs/2411.18956)
Keywords: diffusion
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have gained great attention in adversarial purification. Current diffusion-based works focus on designing effective condition-guided mechanisms while ignoring a fundamental problem, i.e., the original DDPM sampling is intended for stable generation, which may not be the optimal solution for adversarial purification. Inspired by the stability of the Denoising Diffusion Implicit Model (DDIM), we propose an opposite sampling scheme called random sampling. In brief, random sampling will sample from a random noisy space during each diffusion process, while DDPM and DDIM sampling will continuously sample from the adjacent or original noisy space. Thus, random sampling obtains more randomness and achieves stronger robustness against adversarial attacks. Correspondingly, we also introduce a novel mediator conditional guidance to guarantee the consistency of the prediction under the purified image and clean image input. To expand awareness of guided diffusion purification, we conduct a detailed evaluation with different sampling methods and our random sampling achieves an impressive improvement in multiple settings. Leveraging mediator-guided random sampling, we also establish a baseline method named DiffAP, which significantly outperforms state-of-the-art (SOTA) approaches in performance and defensive stability. Remarkably, under strong attack, our DiffAP even achieves a more than 20% robustness advantage with 10$\times$ sampling acceleration.

Title: Perception of Visual Content: Differences Between Humans and Foundation Models

Authors: Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18968
Pdf URL: https://arxiv.org/pdf/2411.18968
Copy Paste: [[2411.18968]] Perception of Visual Content: Differences Between Humans and Foundation Models(https://arxiv.org/abs/2411.18968)
Keywords: foundation model
Abstract: Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels washing their hands. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show low similarity between human and machine annotations from a low-level perspective, i.e., types of words that appear and sentence structures, but are alike in how similar or dissimilar they perceive images across different regions. Additionally, human annotations resulted in best overall and most balanced region classification performance on the class level, while ML Objects and ML Captions performed best for income regression. Humans and machines' similarity in their lack of bias when perceiving images highlights how they are more alike than what was initially perceived. The superior and fairer performance of using human annotations for region classification and machine annotations for income regression show how important the quality of the images and the discriminative features in the annotations are.

Title: SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

Authors: Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, Dacheng Tao
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2411.18983
Pdf URL: https://arxiv.org/pdf/2411.18983
Copy Paste: [[2411.18983]] SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing(https://arxiv.org/abs/2411.18983)
Keywords: generative
Abstract: While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model's performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent's video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.

Title: Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Authors: Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis, yman Elgharabawy, Ahmad Liaqat, Usman Ali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19005
Pdf URL: https://arxiv.org/pdf/2411.19005
Copy Paste: [[2411.19005]] Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement(https://arxiv.org/abs/2411.19005)
Keywords: generative
Abstract: This paper presents a novel deep-learning framework that significantly enhances the transformation of rudimentary face sketches into high-fidelity colour images. Employing a Convolutional Block Attention-based Auto-encoder Network (CA2N), our approach effectively captures and enhances critical facial features through a block attention mechanism within an encoder-decoder architecture. Subsequently, the framework utilises a noise-induced conditional Generative Adversarial Network (cGAN) process that allows the system to maintain high performance even on domains unseen during the training. These enhancements lead to considerable improvements in image realism and fidelity, with our model achieving superior performance metrics that outperform the best method by FID margin of 17, 23, and 38 on CelebAMask-HQ, CUHK, and CUFSF datasets; respectively. The model sets a new state-of-the-art in sketch-to-image generation, can generalize across sketch types, and offers a robust solution for applications such as criminal identification in law enforcement.

Title: PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors

Authors: Guangshun Wei, Yuan Feng, Long Ma, Chen Wang, Yuanfeng Zhou, Changjian Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19036
Pdf URL: https://arxiv.org/pdf/2411.19036
Copy Paste: [[2411.19036]] PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors(https://arxiv.org/abs/2411.19036)
Keywords: diffusion
Abstract: This paper presents PCDreamer, a novel method for point cloud completion. Traditional methods typically extract features from partial point clouds to predict missing regions, but the large solution space often leads to unsatisfactory results. More recent approaches have started to use images as extra guidance, effectively improving performance, but obtaining paired data of images and partial point clouds is challenging in practice. To overcome these limitations, we harness the relatively view-consistent multi-view diffusion priors within large models, to generate novel views of the desired shape. The resulting image set encodes both global and local shape cues, which is especially beneficial for shape completion. To fully exploit the priors, we have designed a shape fusion module for producing an initial complete shape from multi-modality input (\ie, images and point clouds), and a follow-up shape consolidation module to obtain the final complete shape by discarding unreliable points introduced by the inconsistency from diffusion priors. Extensive experimental results demonstrate our superior performance, especially in recovering fine details.

Title: 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

Authors: Tejaswini Medi, Arianna Rampini, Pradyumna Reddy, Pradeep Kumar Jayaraman, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19037
Pdf URL: https://arxiv.org/pdf/2411.19037
Copy Paste: [[2411.19037]] 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes(https://arxiv.org/abs/2411.19037)
Keywords: diffusion, generative
Abstract: Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on ``next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the ``next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as ``next-scale" prediction, we reduce the computational cost of generation compared to traditional ``next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.

Title: I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Authors: Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19050
Pdf URL: https://arxiv.org/pdf/2411.19050
Copy Paste: [[2411.19050]] I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting(https://arxiv.org/abs/2411.19050)
Keywords: diffusion
Abstract: Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at this https URL.

Title: LADDER: Multi-objective Backdoor Attack via Evolutionary Algorithm

Authors: Dazhuang Liu, Yanqi Qiao, Rui Wang, Kaitai Liang, Georgios Smaragdakis
Subjects: cs.CR, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2411.19075
Pdf URL: https://arxiv.org/pdf/2411.19075
Copy Paste: [[2411.19075]] LADDER: Multi-objective Backdoor Attack via Evolutionary Algorithm(https://arxiv.org/abs/2411.19075)
Keywords: anomaly
Abstract: Current black-box backdoor attacks in convolutional neural networks formulate attack objective(s) as single-objective optimization problems in single domain. Designing triggers in single domain harms semantics and trigger robustness as well as introduces visual and spectral anomaly. This work proposes a multi-objective black-box backdoor attack in dual domains via evolutionary algorithm (LADDER), the first instance of achieving multiple attack objectives simultaneously by optimizing triggers without requiring prior knowledge about victim model. In particular, we formulate LADDER as a multi-objective optimization problem (MOP) and solve it via multi-objective evolutionary algorithm (MOEA). MOEA maintains a population of triggers with trade-offs among attack objectives and uses non-dominated sort to drive triggers toward optimal solutions. We further apply preference-based selection to MOEA to exclude impractical triggers. We state that LADDER investigates a new dual-domain perspective for trigger stealthiness by minimizing the anomaly between clean and poisoned samples in the spectral domain. Lastly, the robustness against preprocessing operations is achieved by pushing triggers to low-frequency regions. Extensive experiments comprehensively showcase that LADDER achieves attack effectiveness of at least 99%, attack robustness with 90.23% (50.09% higher than state-of-the-art attacks on average), superior natural stealthiness (1.12x to 196.74x improvement) and excellent spectral stealthiness (8.45x enhancement) as compared to current stealthy attacks by the average $l_2$-norm across 5 public datasets.

Title: ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos

Authors: Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Gool
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19083
Pdf URL: https://arxiv.org/pdf/2411.19083
Copy Paste: [[2411.19083]] ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos(https://arxiv.org/abs/2411.19083)
Keywords: self-supervised
Abstract: In this paper, we focus on the Ego-Exo Object Correspondence task, an emerging challenge in the field of computer vision that aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task, featuring two new modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse effectively fuses language and visual conditions to enhance target object localization, while XObjAlign enforces consistency in object representations across views through a self-supervised alignment strategy. Extensive experiments demonstrate the effectiveness of ObjectRelator, achieving state-of-the-art performance on Ego2Exo and Exo2Ego tasks with minimal additional parameters. This work provides a foundation for future research in comprehensive cross-view object relation understanding highlighting the potential of leveraging multimodal guidance and cross-view alignment. Codes and models will be released to advance further research in this direction.

Title: Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Authors: Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, Fang Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19108
Pdf URL: https://arxiv.org/pdf/2411.19108
Copy Paste: [[2411.19108]] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model(https://arxiv.org/abs/2411.19108)
Keywords: diffusion
Abstract: As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.

Title: Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models

Authors: Chung-Ting Tsai, Ching-Yun Ko, I-Hsin Chung, Yu-Chiang Frank Wang, Pin-Yu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19117
Pdf URL: https://arxiv.org/pdf/2411.19117
Copy Paste: [[2411.19117]] Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models(https://arxiv.org/abs/2411.19117)
Keywords: self-supervised, foundation model, generative
Abstract: The rapid advancement of generative models has introduced serious risks, including deepfake techniques for facial synthesis and editing. Traditional approaches rely on training classifiers and enhancing generalizability through various feature extraction techniques. Meanwhile, training-free detection methods address issues like limited data and overfitting by directly leveraging statistical properties from vision foundation models to distinguish between real and fake images. The current leading training-free approach, RIGID, utilizes DINOv2 sensitivity to perturbations in image space for detecting fake images, with fake image embeddings exhibiting greater sensitivity than those of real images. This observation prompts us to investigate how detection performance varies across model backbones, perturbation types, and datasets. Our experiments reveal that detection performance is closely linked to model robustness, with self-supervised (SSL) models providing more reliable representations. While Gaussian noise effectively detects general objects, it performs worse on facial images, whereas Gaussian blur is more effective due to potential frequency artifacts. To further improve detection, we introduce Contrastive Blur, which enhances performance on facial images, and MINDER (MINimum distance DetEctoR), which addresses noise type bias, balancing performance across domains. Beyond performance gains, our work offers valuable insights for both the generative and detection communities, contributing to a deeper understanding of model robustness property utilized for deepfake detection.

Title: MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Authors: Daewon Yoon, Hyungsuk Lee, Wonsik Shin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19121
Pdf URL: https://arxiv.org/pdf/2411.19121
Copy Paste: [[2411.19121]] MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation(https://arxiv.org/abs/2411.19121)
Keywords: diffusion
Abstract: This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.

Title: SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Authors: Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19182
Pdf URL: https://arxiv.org/pdf/2411.19182
Copy Paste: [[2411.19182]] SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation(https://arxiv.org/abs/2411.19182)
Keywords: diffusion, generative
Abstract: Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

Title: Video Depth without Video Models

Authors: Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19189
Pdf URL: https://arxiv.org/pdf/2411.19189
Copy Paste: [[2411.19189]] Video Depth without Video Models(https://arxiv.org/abs/2411.19189)
Keywords: diffusion, foundation model
Abstract: Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: this http URL.

Title: Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection

Authors: Tsun-Hin Cheung, Ka-Chun Fung, Songjiang Lai, Kwan-Ho Lin, Vincent Ng, Kin-Man Lam
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2411.19220
Pdf URL: https://arxiv.org/pdf/2411.19220
Copy Paste: [[2411.19220]] Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection(https://arxiv.org/abs/2411.19220)
Keywords: foundation model, anomaly
Abstract: Identifying defects and anomalies in industrial products is a critical quality control task. Traditional manual inspection methods are slow, subjective, and error-prone. In this work, we propose a novel zero-shot training-free approach for automated industrial image anomaly detection using a multimodal machine learning pipeline, consisting of three foundation models. Our method first uses a large language model, i.e., GPT-3. generate text prompts describing the expected appearances of normal and abnormal products. We then use a grounding object detection model, called Grounding DINO, to locate the product in the image. Finally, we compare the cropped product image patches to the generated prompts using a zero-shot image-text matching model, called CLIP, to identify any anomalies. Our experiments on two datasets of industrial product images, namely MVTec-AD and VisA, demonstrate the effectiveness of this method, achieving high accuracy in detecting various types of defects and anomalies without the need for model training. Our proposed model enables efficient, scalable, and objective quality control in industrial manufacturing settings.

Title: Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Authors: Xinxu Wei, Kanhao Zhao, Yong Jiao, Nancy B. Carlisle, Hua Xie, Yu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19230
Pdf URL: https://arxiv.org/pdf/2411.19230
Copy Paste: [[2411.19230]] Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG(https://arxiv.org/abs/2411.19230)
Keywords: self-supervised, generative
Abstract: Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this by framing it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled/labeled and high/low-density EEG data. To fully leverage the abundant unlabeled EEG data, we introduce a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates Graph Contrastive Pre-training and Graph Masked Autoencoder Pre-training. This approach synergistically combines contrastive and generative pre-training techniques by reconstructing contrastive samples and contrasting the reconstructions. For knowledge distillation from high-density to low-density EEG data, we propose a Graph Topology Distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data, effectively handling missing electrodes through contrastive distillation. To integrate transfer learning and distillation, we jointly pre-train the teacher and student models by contrasting their queries and keys during pre-training, enabling robust distillers for downstream tasks. We demonstrate the effectiveness of our method on four classification tasks across two clinical EEG datasets with abundant unlabeled data and limited labeled data. The experimental results show that our approach significantly outperforms contemporary methods in both efficiency and accuracy.

Title: Z-STAR+: A Zero-shot Style Transfer Method via Adjusting Style Distribution

Authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19231
Pdf URL: https://arxiv.org/pdf/2411.19231
Copy Paste: [[2411.19231]] Z-STAR+: A Zero-shot Style Transfer Method via Adjusting Style Distribution(https://arxiv.org/abs/2411.19231)
Keywords: diffusion, generative
Abstract: Style transfer presents a significant challenge, primarily centered on identifying an appropriate style representation. Conventional methods employ style loss, derived from second-order statistics or contrastive learning, to constrain style representation in the stylized result. However, these pre-defined style representations often limit stylistic expression, leading to artifacts. In contrast to existing approaches, we have discovered that latent features in vanilla diffusion models inherently contain natural style and content distributions. This allows for direct extraction of style information and seamless integration of generative priors into the content image without necessitating retraining. Our method adopts dual denoising paths to represent content and style references in latent space, subsequently guiding the content image denoising process with style latent codes. We introduce a Cross-attention Reweighting module that utilizes local content features to query style image information best suited to the input patch, thereby aligning the style distribution of the stylized results with that of the style image. Furthermore, we design a scaled adaptive instance normalization to mitigate inconsistencies in color distribution between style and stylized images on a global scale. Through theoretical analysis and extensive experimentation, we demonstrate the effectiveness and superiority of our diffusion-based \uline{z}ero-shot \uline{s}tyle \uline{t}ransfer via \uline{a}djusting style dist\uline{r}ibution, termed Z-STAR+.

Title: Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

Authors: Thomas Wimmer, Michael Oechsle, Michael Niemeyer, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19233
Pdf URL: https://arxiv.org/pdf/2411.19233
Copy Paste: [[2411.19233]] Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes(https://arxiv.org/abs/2411.19233)
Keywords: diffusion, generative
Abstract: State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack "liveliness," a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.

Title: SmartLLMSentry: A Comprehensive LLM Based Smart Contract Vulnerability Detection Framework

Authors: Oualid Zaazaa, Hanan El Bakkali
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19234
Pdf URL: https://arxiv.org/pdf/2411.19234
Copy Paste: [[2411.19234]] SmartLLMSentry: A Comprehensive LLM Based Smart Contract Vulnerability Detection Framework(https://arxiv.org/abs/2411.19234)
Keywords: in-context
Abstract: Smart contracts are essential for managing digital assets in blockchain networks, highlighting the need for effective security measures. This paper introduces SmartLLMSentry, a novel framework that leverages large language models (LLMs), specifically ChatGPT with in-context training, to advance smart contract vulnerability detection. Traditional rule-based frameworks have limitations in integrating new detection rules efficiently. In contrast, SmartLLMSentry utilizes LLMs to streamline this process. We created a specialized dataset of five randomly selected vulnerabilities for model training and evaluation. Our results show an exact match accuracy of 91.1% with sufficient data, although GPT-4 demonstrated reduced performance compared to GPT-3 in rule generation. This study illustrates that SmartLLMSentry significantly enhances the speed and accuracy of vulnerability detection through LLMdriven rule integration, offering a new approach to improving Blockchain security and addressing previously underexplored vulnerabilities in smart contracts.

Title: Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation

Authors: Xuehao Cui, Guangyang Wu, Zhenghao Gan, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19246
Pdf URL: https://arxiv.org/pdf/2411.19246
Copy Paste: [[2411.19246]] Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation(https://arxiv.org/abs/2411.19246)
Keywords: diffusion
Abstract: Existing methods to generate aesthetic QR codes, such as image and style transfer techniques, tend to compromise either the visual appeal or the scannability of QR codes when they incorporate human face identity. Addressing these imperfections, we present Face2QR-a novel pipeline specifically designed for generating personalized QR codes that harmoniously blend aesthetics, face identity, and scannability. Our pipeline introduces three innovative components. First, the ID-refined QR integration (IDQR) seamlessly intertwines the background styling with face ID, utilizing a unified Stable Diffusion (SD)-based framework with control networks. Second, the ID-aware QR ReShuffle (IDRS) effectively rectifies the conflicts between face IDs and QR patterns, rearranging QR modules to maintain the integrity of facial features without compromising scannability. Lastly, the ID-preserved Scannability Enhancement (IDSE) markedly boosts scanning robustness through latent code optimization, striking a delicate balance between face ID, aesthetic quality and QR functionality. In comprehensive experiments, Face2QR demonstrates remarkable performance, outperforming existing approaches, particularly in preserving facial recognition features within custom QR code designs. Codes are available at $\href{this https URL}{\text{this URL link}}$.

Title: Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

Authors: Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, Huan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19261
Pdf URL: https://arxiv.org/pdf/2411.19261
Copy Paste: [[2411.19261]] Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention(https://arxiv.org/abs/2411.19261)
Keywords: diffusion
Abstract: Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals the two primary issues contributing to this deficiency. Firstly, there is undesired interference among different subjects within the target image. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these challenges, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject fusion. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that the proposed methods significantly enhance multi-subject consistency, outperforming all existing methods in open-domain scenarios.

Title: GMS-VINS:Multi-category Dynamic Objects Semantic Segmentation for Enhanced Visual-Inertial Odometry Using a Promptable Foundation Model

Authors: Rui Zhou, Jingbin Liu, Junbin Xie, Jianyu Zhang, Yingze Hu, Jiele Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19289
Pdf URL: https://arxiv.org/pdf/2411.19289
Copy Paste: [[2411.19289]] GMS-VINS:Multi-category Dynamic Objects Semantic Segmentation for Enhanced Visual-Inertial Odometry Using a Promptable Foundation Model(https://arxiv.org/abs/2411.19289)
Keywords: foundation model
Abstract: Visual-inertial odometry (VIO) is widely used in various fields, such as robots, drones, and autonomous vehicles, due to its low cost and complementary sensors. Most VIO methods presuppose that observed objects are static and time-invariant. However, real-world scenes often feature dynamic objects, compromising the accuracy of pose estimation. These moving entities include cars, trucks, buses, motorcycles, and pedestrians. The diversity and partial occlusion of these objects present a tough challenge for existing dynamic object removal techniques. To tackle this challenge, we introduce GMS-VINS, which integrates an enhanced SORT algorithm along with a robust multi-category segmentation framework into VIO, thereby improving pose estimation accuracy in environments with diverse dynamic objects and frequent occlusions. Leveraging the promptable foundation model, our solution efficiently tracks and segments a wide range of object categories. The enhanced SORT algorithm significantly improves the reliability of tracking multiple dynamic objects, especially in urban settings with partial occlusions or swift movements. We evaluated our proposed method using multiple public datasets representing various scenes, as well as in a real-world scenario involving diverse dynamic objects. The experimental results demonstrate that our proposed method performs impressively in multiple scenarios, outperforming other state-of-the-art methods. This highlights its remarkable generalization and adaptability in diverse dynamic environments, showcasing its potential to handle various dynamic objects in practical applications.

Title: Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation

Authors: Son Thai Ly, Hien V. Nguyen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19297
Pdf URL: https://arxiv.org/pdf/2411.19297
Copy Paste: [[2411.19297]] Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation(https://arxiv.org/abs/2411.19297)
Keywords: self-supervised, foundation model
Abstract: Adapting vision transformer foundation models through parameter-efficient fine-tuning (PEFT) methods has become increasingly popular. These methods optimize a limited subset of parameters, enabling efficient adaptation without the need to fine-tune the entire model while still achieving competitive performance. However, traditional PEFT methods may limit the model's capacity to capture complex patterns, especially those associated with high-frequency spectra. This limitation becomes particularly problematic as existing research indicates that high-frequency features are crucial for distinguishing subtle image structures. To address this issue, we introduce FreqFit, a novel Frequency Fine-tuning module between ViT blocks to enhance model adaptability. FreqFit is simple yet surprisingly effective, and can be integrated with all existing PEFT methods to boost their performance. By manipulating features in the frequency domain, our approach allows models to capture subtle patterns more effectively. Extensive experiments on 24 datasets, using both supervised and self-supervised foundational models with various state-of-the-art PEFT methods, reveal that FreqFit consistently improves performance over the original PEFT methods with performance gains ranging from 1% to 16%. For instance, FreqFit-LoRA surpasses the performances of state-of-the-art baselines on CIFAR100 by more than 10% even without applying regularization or strong augmentation. For reproducibility purposes, the source code is available at this https URL.

Title: Trajectory Attention for Fine-grained Video Motion Control

Authors: Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19324
Pdf URL: https://arxiv.org/pdf/2411.19324
Copy Paste: [[2411.19324]] Trajectory Attention for Fine-grained Video Motion Control(https://arxiv.org/abs/2411.19324)
Keywords: diffusion
Abstract: Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when the trajectory is only partially available. Experiments on camera motion control for images and videos demonstrate significant improvements in precision and long-range consistency while maintaining high-quality generation. Furthermore, we show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing, where it excels in maintaining content consistency over large spatial and temporal ranges.

Title: Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors: Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.19331
Pdf URL: https://arxiv.org/pdf/2411.19331
Copy Paste: [[2411.19331]] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation(https://arxiv.org/abs/2411.19331)
Keywords: self-supervised
Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: this https URL.

Title: Towards a Mechanistic Explanation of Diffusion Model Generalization

Authors: Matthew Niedoba, Berend Zwartsenberg, Kevin Murphy, Frank Wood
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19339
Pdf URL: https://arxiv.org/pdf/2411.19339
Copy Paste: [[2411.19339]] Towards a Mechanistic Explanation of Diffusion Model Generalization(https://arxiv.org/abs/2411.19339)
Keywords: diffusion
Abstract: We propose a mechanism for diffusion generalization based on local denoising operations. Through analysis of network and empirical denoisers, we identify local inductive biases in diffusion models. We demonstrate that local denoising operations can be used to approximate the optimal diffusion denoiser. Using a collection of patch-based, local empirical denoisers, we construct a denoiser which approximates the generalization behaviour of diffusion model denoisers over forward and reverse diffusion processes.

Title: CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

Authors: Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19346
Pdf URL: https://arxiv.org/pdf/2411.19346
Copy Paste: [[2411.19346]] CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections(https://arxiv.org/abs/2411.19346)
Keywords: self-supervised, foundation model
Abstract: In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFter across 11 diverse image classification datasets.

Title: DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities

Authors: Hui Dai, Dan Pechi, Xinyi Yang, Garvit Banga, Raghav Mantri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19360
Pdf URL: https://arxiv.org/pdf/2411.19360
Copy Paste: [[2411.19360]] DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities(https://arxiv.org/abs/2411.19360)
Keywords: in-context
Abstract: The Needle-in-a-haystack (NIAH) test is a general task used to assess language models' (LMs') abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs' abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs' NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.

Title: Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

Authors: Gaurav Rai, Ojaswa Sharma
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19381
Pdf URL: https://arxiv.org/pdf/2411.19381
Copy Paste: [[2411.19381]] Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints(https://arxiv.org/abs/2411.19381)
Keywords: diffusion
Abstract: Animating hand-drawn sketches using traditional tools is challenging and complex. Sketches provide a visual basis for explanations, and animating these sketches offers an experience of real-time scenarios. We propose an approach for animating a given input sketch based on a descriptive text prompt. Our method utilizes a parametric representation of the sketch's strokes. Unlike previous methods, which struggle to estimate smooth and accurate motion and often fail to preserve the sketch's topology, we leverage a pre-trained text-to-video diffusion model with SDS loss to guide the motion of the sketch's strokes. We introduce length-area (LA) regularization to ensure temporal consistency by accurately estimating the smooth displacement of control points across the frame sequence. Additionally, to preserve shape and avoid topology changes, we apply a shape-preserving As-Rigid-As-Possible (ARAP) loss to maintain sketch rigidity. Our method surpasses state-of-the-art performance in both quantitative and qualitative evaluations.

Title: DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

Authors: Shwetha Ram, Tal Neiman, Qianli Feng, Andrew Stuart, Son Tran, Trishul Chilimbi
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19390
Pdf URL: https://arxiv.org/pdf/2411.19390
Copy Paste: [[2411.19390]] DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models(https://arxiv.org/abs/2411.19390)
Keywords: diffusion
Abstract: Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.

Title: AMO Sampler: Enhancing Text Rendering with Overshooting

Authors: Xixi Hu, Keyang Xu, Bo Liu, Qiang Liu, Hongliang Fei
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19415
Pdf URL: https://arxiv.org/pdf/2411.19415
Copy Paste: [[2411.19415]] AMO Sampler: Enhancing Text Rendering with Overshooting(https://arxiv.org/abs/2411.19415)
Keywords: diffusion
Abstract: Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.

Title: Any-Resolution AI-Generated Image Detection by Spectral Learning

Authors: Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, Efstratios Gavves
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.19417
Pdf URL: https://arxiv.org/pdf/2411.19417
Copy Paste: [[2411.19417]] Any-Resolution AI-Generated Image Detection by Spectral Learning(https://arxiv.org/abs/2411.19417)
Keywords: self-supervised, generative
Abstract: Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations.

Title: Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Authors: Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19458
Pdf URL: https://arxiv.org/pdf/2411.19458
Copy Paste: [[2411.19458]] Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning(https://arxiv.org/abs/2411.19458)
Keywords: foundation model
Abstract: Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at this https URL.

Title: Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis

Authors: Ruoqi Wang, Haitao Wang, Qiong Luo
Subjects: cs.CV, astro-ph.GA, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19475
Pdf URL: https://arxiv.org/pdf/2411.19475
Copy Paste: [[2411.19475]] Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis(https://arxiv.org/abs/2411.19475)
Keywords: foundation model
Abstract: Galaxy morphology analysis involves classifying galaxies by their shapes and structures. For this task, directly training domain-specific models on large, annotated astronomical datasets is effective but costly. In contrast, fine-tuning vision foundation models on a smaller set of astronomical images is more resource-efficient but generally results in lower accuracy. To harness the benefits of both approaches and address their shortcomings, we propose GalaxAlign, a novel method that fine-tunes pre-trained foundation models to achieve high accuracy on astronomical tasks. Specifically, our method extends a contrastive learning architecture to align three types of data in fine-tuning: (1) a set of schematic symbols representing galaxy shapes and structures, (2) textual labels of these symbols, and (3) galaxy images. This way, GalaxAlign not only eliminates the need for expensive pretraining but also enhances the effectiveness of fine-tuning. Extensive experiments on galaxy classification and similarity search demonstrate that our method effectively fine-tunes general pre-trained models for astronomical tasks by incorporating domain-specific multi-modal knowledge.

Title: Graph-Enhanced EEG Foundation Model

Authors: Limin Wang, Toyotaro Suzumura, Hiroki Kanezashi
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.19507
Pdf URL: https://arxiv.org/pdf/2411.19507
Copy Paste: [[2411.19507]] Graph-Enhanced EEG Foundation Model(https://arxiv.org/abs/2411.19507)
Keywords: foundation model
Abstract: Electroencephalography (EEG) signals provide critical insights for applications in disease diagnosis and healthcare. However, the scarcity of labeled EEG data poses a significant challenge. Foundation models offer a promising solution by leveraging large-scale unlabeled data through pre-training, enabling strong performance across diverse tasks. While both temporal dynamics and inter-channel relationships are vital for understanding EEG signals, existing EEG foundation models primarily focus on the former, overlooking the latter. To address this limitation, we propose a novel foundation model for EEG that integrates both temporal and inter-channel information. Our architecture combines Graph Neural Networks (GNNs), which effectively capture relational structures, with a masked autoencoder to enable efficient pre-training. We evaluated our approach using three downstream tasks and experimented with various GNN architectures. The results demonstrate that our proposed model, particularly when employing the GCN architecture with optimized configurations, consistently outperformed baseline methods across all tasks. These findings suggest that our model serves as a robust foundation model for EEG analysis.

Title: Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Authors: Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19509
Pdf URL: https://arxiv.org/pdf/2411.19509
Copy Paste: [[2411.19509]] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis(https://arxiv.org/abs/2411.19509)
Keywords: diffusion
Abstract: Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.

Title: DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Authors: Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19527
Pdf URL: https://arxiv.org/pdf/2411.19527
Copy Paste: [[2411.19527]] DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding(https://arxiv.org/abs/2411.19527)
Keywords: generative
Abstract: Human motion, inherently continuous and dynamic, presents significant challenges for generative models. Despite their dominance, discrete quantization methods, such as VQ-VAEs, suffer from inherent limitations, including restricted expressiveness and frame-wise noise artifacts. Continuous approaches, while producing smoother and more natural motions, often falter due to high-dimensional complexity and limited training data. To resolve this "discord" between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that decodes discrete motion tokens into continuous motion through rectified flow. By employing an iterative refinement process in the continuous space, DisCoRD captures fine-grained dynamics and ensures smoother and more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results solidify DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Our project page is available at: this https URL.

Title: RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Authors: Xianfeng Tan, Yuhan Li, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Ran Lin, Bingbing Ni
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19528
Pdf URL: https://arxiv.org/pdf/2411.19528
Copy Paste: [[2411.19528]] RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation(https://arxiv.org/abs/2411.19528)
Keywords: diffusion, generative
Abstract: Standard clothing asset generation involves creating forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized sampling distributions and precise structural requirements in the generated images. Existing models have limited spatial perception and often exhibit structural hallucinations in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating external knowledge from LLM and databases. RAGDiffusion consists of two core processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a three-level alignment that ensures fidelity in structural, pattern, and decoding components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and detail-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.

Title: QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

Authors: Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19534
Pdf URL: https://arxiv.org/pdf/2411.19534
Copy Paste: [[2411.19534]] QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain(https://arxiv.org/abs/2411.19534)
Keywords: generative
Abstract: We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.

Title: Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook

Authors: Florinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru, Nicolae Catalin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19537
Pdf URL: https://arxiv.org/pdf/2411.19537
Copy Paste: [[2411.19537]] Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook(https://arxiv.org/abs/2411.19537)
Keywords: diffusion, generative
Abstract: With the recent advancements in generative modeling, the realism of deepfake content has been increasing at a steady pace, even reaching the point where people often fail to detect manipulated media content online, thus being deceived into various kinds of scams. In this paper, we survey deepfake generation and detection techniques, including the most recent developments in the field, such as diffusion models and Neural Radiance Fields. Our literature review covers all deepfake media types, comprising image, video, audio and multimodal (audio-visual) content. We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content. We further construct a taxonomy of deepfake generation and detection methods, illustrating the important groups of methods and the domains where these methods are applied. Next, we gather datasets used for deepfake detection and provide updated rankings of the best performing deepfake detectors on the most popular datasets. In addition, we develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content. The results indicate that state-of-the-art detectors fail to generalize to deepfake content generated by unseen deepfake generators. Finally, we propose future directions to obtain robust and powerful deepfake detectors. Our project page and new benchmark are available at this https URL.

Title: KV Shifting Attention Enhances Language Modeling

Authors: Mingyu Xu, Wei Cheng, Bingning Wang, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19574
Pdf URL: https://arxiv.org/pdf/2411.19574
Copy Paste: [[2411.19574]] KV Shifting Attention Enhances Language Modeling(https://arxiv.org/abs/2411.19574)
Keywords: in-context
Abstract: The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.

Title: In-Context Learning with Noisy Labels

Authors: Junyong Kang, Donghyun Son, Hwanjun Song, Buru Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19581
Pdf URL: https://arxiv.org/pdf/2411.19581
Copy Paste: [[2411.19581]] In-Context Learning with Noisy Labels(https://arxiv.org/abs/2411.19581)
Keywords: in-context
Abstract: In-context learning refers to the emerging ability of large language models (LLMs) to perform a target task without additional training, utilizing demonstrations of the task. Recent studies aim to enhance in-context learning performance by selecting more useful demonstrations. However, they overlook the presence of inevitable noisy labels in task demonstrations that arise during the labeling process in the real-world. In this paper, we propose a new task, in-context learning with noisy labels, which aims to solve real-world problems for in-context learning where labels in task demonstrations would be corrupted. Moreover, we propose a new method and baseline methods for the new task, inspired by studies in learning with noisy labels. Through experiments, we demonstrate that our proposed method can serve as a safeguard against performance degradation in in-context learning caused by noisy labels.

Title: LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Authors: Taja Kuzman, Nikola Ljubešić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19638
Pdf URL: https://arxiv.org/pdf/2411.19638
Copy Paste: [[2411.19638]] LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification(https://arxiv.org/abs/2411.19638)
Keywords: generative
Abstract: With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.

Title: Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing

Authors: Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19652
Pdf URL: https://arxiv.org/pdf/2411.19652
Copy Paste: [[2411.19652]] Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing(https://arxiv.org/abs/2411.19652)
Keywords: diffusion
Abstract: Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at this https URL.

Title: TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

Authors: Bojun Xiong, Jialun Liu, Jiakui Hu, Chenming Wu, Jinbo Wu, Xing Liu, Chen Zhao, Errui Ding, Zhouhui Lian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19654
Pdf URL: https://arxiv.org/pdf/2411.19654
Copy Paste: [[2411.19654]] TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting(https://arxiv.org/abs/2411.19654)
Keywords: diffusion
Abstract: Physically Based Rendering (PBR) materials play a crucial role in modern graphics, enabling photorealistic rendering across diverse environment maps. Developing an effective and efficient algorithm that is capable of automatically generating high-quality PBR materials rather than RGB texture for 3D meshes can significantly streamline the 3D content creation. Most existing methods leverage pre-trained 2D diffusion models for multi-view image synthesis, which often leads to severe inconsistency between the generated textures and input 3D meshes. This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. Specifically, we place each 3D Gaussian on the finest leaf node of the octree built from the input 3D mesh to render the multiview images not only for the albedo map but also for roughness and metallic. Moreover, our model is trained in a regression manner instead of diffusion denoising, capable of generating the PBR material for a 3D mesh in a single feed-forward process. Extensive experiments on publicly available benchmarks demonstrate that our method synthesizes more visually pleasing PBR materials and runs faster than previous methods in both unconditional and text-conditional scenarios, which exhibit better consistency with the given geometry. Our code and trained models are available at this https URL.

Title: MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications

Authors: Gasser Elazab, Torben Gräber, Michael Unterreiner, Olaf Hellwich
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2411.19717
Pdf URL: https://arxiv.org/pdf/2411.19717
Copy Paste: [[2411.19717]] MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications(https://arxiv.org/abs/2411.19717)
Keywords: self-supervised
Abstract: Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.

Title: JetFormer: An Autoregressive Generative Model of Raw Images and Text

Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19722
Pdf URL: https://arxiv.org/pdf/2411.19722
Copy Paste: [[2411.19722]] JetFormer: An Autoregressive Generative Model of Raw Images and Text(https://arxiv.org/abs/2411.19722)
Keywords: generative
Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.

Title: Real-Time Anomaly Detection in Video Streams

Authors: Fabien Poirier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19731
Pdf URL: https://arxiv.org/pdf/2411.19731
Copy Paste: [[2411.19731]] Real-Time Anomaly Detection in Video Streams(https://arxiv.org/abs/2411.19731)
Keywords: anomaly
Abstract: This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.

Title: A Note on Small Percolating Sets on Hypercubes via Generative AI

Authors: Gergely Bérczi, Adam Zsolt Wagner
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2411.19734
Pdf URL: https://arxiv.org/pdf/2411.19734
Copy Paste: [[2411.19734]] A Note on Small Percolating Sets on Hypercubes via Generative AI(https://arxiv.org/abs/2411.19734)
Keywords: generative
Abstract: We apply a generative AI pattern-recognition technique called PatternBoost to study bootstrap percolation on hypercubes. With this, we slightly improve the best existing upper bound for the size of percolating subsets of the hypercube.

Title: HVAC-DPT: A Decision Pretrained Transformer for HVAC Control

Authors: Anaïs Berkes
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2411.19746
Pdf URL: https://arxiv.org/pdf/2411.19746
Copy Paste: [[2411.19746]] HVAC-DPT: A Decision Pretrained Transformer for HVAC Control(https://arxiv.org/abs/2411.19746)
Keywords: in-context
Abstract: Building operations consume approximately 40% of global energy, with Heating, Ventilation, and Air Conditioning (HVAC) systems responsible for up to 50% of this consumption. As HVAC energy demands are expected to rise, optimising system efficiency is crucial for reducing future energy use and mitigating climate change. Existing control strategies lack generalisation and require extensive training and data, limiting their rapid deployment across diverse buildings. This paper introduces HVAC-DPT, a Decision-Pretrained Transformer using in-context Reinforcement Learning (RL) for multi-zone HVAC control. HVAC-DPT frames HVAC control as a sequential prediction task, training a causal transformer on interaction histories generated by diverse RL agents. This approach enables HVAC-DPT to refine its policy in-context, without modifying network parameters, allowing for deployment across different buildings without the need for additional training or data collection. HVAC-DPT reduces energy consumption in unseen buildings by 45% compared to the baseline controller, offering a scalable and effective approach to mitigating the increasing environmental impact of HVAC systems.

Title: Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models

Authors: Kaican Li, Weiyan Xie, Yongxiang Huang, Didan Deng, Lanqing Hong, Zhenguo Li, Ricardo Silva, Nevin L. Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19757
Pdf URL: https://arxiv.org/pdf/2411.19757
Copy Paste: [[2411.19757]] Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models(https://arxiv.org/abs/2411.19757)
Keywords: foundation model
Abstract: Fine-tuning foundation models often compromises their robustness to distribution shifts. To remedy this, most robust fine-tuning methods aim to preserve the pre-trained features. However, not all pre-trained features are robust and those methods are largely indifferent to which ones to preserve. We propose dual risk minimization (DRM), which combines empirical risk minimization with worst-case risk minimization, to better preserve the core features of downstream tasks. In particular, we utilize core-feature descriptions generated by LLMs to induce core-based zero-shot predictions which then serve as proxies to estimate the worst-case risk. DRM balances two crucial aspects of model robustness: expected performance and worst-case performance, establishing a new state of the art on various real-world benchmarks. DRM significantly improves the out-of-distribution performance of CLIP ViT-L/14@336 on ImageNet (75.9 to 77.1), WILDS-iWildCam (47.1 to 51.8), and WILDS-FMoW (50.7 to 53.1); opening up new avenues for robust fine-tuning. Our code is available at this https URL .

Title: Riemannian Denoising Score Matching for Molecular Structure Optimization with Accurate Energy

Authors: Jeheon Woo, Seonghwan Kim, Jun Hyeong Kim, Woo Youn Kim
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2411.19769
Pdf URL: https://arxiv.org/pdf/2411.19769
Copy Paste: [[2411.19769]] Riemannian Denoising Score Matching for Molecular Structure Optimization with Accurate Energy(https://arxiv.org/abs/2411.19769)
Keywords: diffusion
Abstract: This study introduces a modified score matching method aimed at generating molecular structures with high energy accuracy. The denoising process of score matching or diffusion models mirrors molecular structure optimization, where scores act like physical force fields that guide particles toward equilibrium states. To achieve energetically accurate structures, it can be advantageous to have the score closely approximate the gradient of the actual potential energy surface. Unlike conventional methods that simply design the target score based on structural differences in Euclidean space, we propose a Riemannian score matching approach. This method represents molecular structures on a manifold defined by physics-informed internal coordinates to efficiently mimic the energy landscape, and performs noising and denoising within this space. Our method has been evaluated by refining several types of starting structures on the QM9 and GEOM datasets, demonstrating that the proposed Riemannian score matching method significantly improves the accuracy of the generated molecular structures, attaining chemical accuracy. The implications of this study extend to various applications in computational chemistry, offering a robust tool for accurate molecular structure prediction.

Title: MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

Authors: Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19786
Pdf URL: https://arxiv.org/pdf/2411.19786
Copy Paste: [[2411.19786]] MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks(https://arxiv.org/abs/2411.19786)
Keywords: diffusion, generative
Abstract: Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.

Title: INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Authors: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19799
Pdf URL: https://arxiv.org/pdf/2411.19799
Copy Paste: [[2411.19799]] INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge(https://arxiv.org/abs/2411.19799)
Keywords: generative
Abstract: The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.

Title: DeMo: Decoupled Momentum Optimization

Authors: Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19870
Pdf URL: https://arxiv.org/pdf/2411.19870
Copy Paste: [[2411.19870]] DeMo: Decoupled Momentum Optimization(https://arxiv.org/abs/2411.19870)
Keywords: foundation model
Abstract: Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at this https URL

Title: Open source Differentiable ODE Solving Infrastructure

Authors: Rakshit Kr. Singh, Aaron Rock Menezes, Rida Irfan, Bharath Ramsundar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19882
Pdf URL: https://arxiv.org/pdf/2411.19882
Copy Paste: [[2411.19882]] Open source Differentiable ODE Solving Infrastructure(https://arxiv.org/abs/2411.19882)
Keywords: diffusion
Abstract: Ordinary Differential Equations (ODEs) are widely used in physics, chemistry, and biology to model dynamic systems, including reaction kinetics, population dynamics, and biological processes. In this work, we integrate GPU-accelerated ODE solvers into the open-source DeepChem framework, making these tools easily accessible. These solvers support multiple numerical methods and are fully differentiable, enabling easy integration into more complex differentiable programs. We demonstrate the capabilities of our implementation through experiments on Lotka-Volterra predator-prey dynamics, pharmacokinetic compartment models, neural ODEs, and solving PDEs using reaction-diffusion equations. Our solvers achieved high accuracy with mean squared errors ranging from $10^{-4}$ to $10^{-6}$ and showed scalability in solving large systems with up to 100 compartments.

Title: FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation

Authors: Chang Won Lee, Selina Leveugle, Svetlana Stolpner, Chris Langley, Paul Grouchy, Jonathan Kelly, Steven L. Waslander
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19888
Pdf URL: https://arxiv.org/pdf/2411.19888
Copy Paste: [[2411.19888]] FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation(https://arxiv.org/abs/2411.19888)
Keywords: self-supervised, foundation model, anomaly
Abstract: Anomaly segmentation is a valuable computer vision task for safety-critical applications that need to be aware of unexpected events. Current state-of-the-art (SOTA) scene-level anomaly segmentation approaches rely on diverse inlier class labels during training, limiting their ability to leverage vast unlabeled datasets and pre-trained vision encoders. These methods may underperform in domains with reduced color diversity and limited object classes. Conversely, existing unsupervised methods struggle with anomaly segmentation with the diverse scenes of less restricted domains. To address these challenges, we introduce FlowCLAS, a novel self-supervised framework that utilizes vision foundation models to extract rich features and employs a normalizing flow network to learn their density distribution. We enhance the model's discriminative power by incorporating Outlier Exposure and contrastive learning in the latent space. FlowCLAS significantly outperforms all existing methods on the ALLO anomaly segmentation benchmark for space robotics and demonstrates competitive results on multiple road anomaly segmentation benchmarks for autonomous driving, including Fishyscapes Lost&Found and Road Anomaly. These results highlight FlowCLAS's effectiveness in addressing the unique challenges of space anomaly segmentation while retaining SOTA performance in the autonomous driving domain without reliance on inlier segmentation labels.

Title: $C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields

Authors: Prajwal Singh, Ashish Tiwari, Gautam Vashishtha, Shanmuganathan Raman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19903
Pdf URL: https://arxiv.org/pdf/2411.19903
Copy Paste: [[2411.19903]] $C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields(https://arxiv.org/abs/2411.19903)
Keywords: generative
Abstract: Neural radiance fields (NeRF) have exhibited highly photorealistic rendering of novel views through per-scene optimization over a single 3D scene. With the growing popularity of NeRF and its variants, they have become ubiquitous and have been identified as efficient 3D resources. However, they are still far from being scalable since a separate model needs to be stored for each scene, and the training time increases linearly with every newly added scene. Surprisingly, the idea of encoding multiple 3D scenes into a single NeRF model is heavily under-explored. In this work, we propose a novel conditional-cum-continual framework, called $C^{3}$-NeRF, to accommodate multiple scenes into the parameters of a single neural radiance field. Unlike conventional approaches that leverage feature extractors and pre-trained priors for scene conditioning, we use simple pseudo-scene labels to model multiple scenes in NeRF. Interestingly, we observe the framework is also inherently continual (via generative replay) with minimal, if not no, forgetting of the previously learned scenes. Consequently, the proposed framework adapts to multiple new scenes without necessarily accessing the old data. Through extensive qualitative and quantitative evaluation using synthetic and real datasets, we demonstrate the inherent capacity of the NeRF model to accommodate multiple scenes with high-quality novel-view renderings without adding additional parameters. We provide implementation details and dynamic visualizations of our results in the supplementary file.