2024-12-20

Title: Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing

Authors: Le-Anh Tran, Dong-Chul Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14220
Pdf URL: https://arxiv.org/pdf/2412.14220
Copy Paste: [[2412.14220]] Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing(https://arxiv.org/abs/2412.14220)
Keywords: generative
Abstract: This paper proposes a lightweight neural network designed for realistic image dehazing, utilizing a Distilled Pooling Transformer Encoder, named DPTE-Net. Recently, while vision transformers (ViTs) have achieved great success in various vision tasks, their self-attention (SA) module's complexity scales quadratically with image resolution, hindering their applicability on resource-constrained devices. To overcome this, the proposed DPTE-Net substitutes traditional SA modules with efficient pooling mechanisms, significantly reducing computational demands while preserving ViTs' learning capabilities. To further enhance semantic feature learning, a distillation-based training process is implemented which transfers rich knowledge from a larger teacher network to DPTE-Net. Additionally, DPTE-Net is trained within a generative adversarial network (GAN) framework, leveraging the strong generalization of GAN in image restoration, and employs a transmission-aware loss function to dynamically adapt to varying haze densities. Experimental results on various benchmark datasets have shown that the proposed DPTE-Net can achieve competitive dehazing performance when compared to state-of-the-art methods while maintaining low computational complexity, making it a promising solution for resource-limited applications. The code of this work is available at this https URL.

Title: Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data

Authors: haina Raza, Drai Paulen-Patterson, Chen Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14276
Pdf URL: https://arxiv.org/pdf/2412.14276
Copy Paste: [[2412.14276]] Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data(https://arxiv.org/abs/2412.14276)
Keywords: generative
Abstract: Fake news poses a significant threat to public opinion and social stability in modern society. This study presents a comparative evaluation of BERT-like encoder-only models and autoregressive decoder-only large language models (LLMs) for fake news detection. We introduce a dataset of news articles labeled with GPT-4 assistance (an AI-labeling method) and verified by human experts to ensure reliability. Both BERT-like encoder-only models and LLMs were fine-tuned on this dataset. Additionally, we developed an instruction-tuned LLM approach with majority voting during inference for label generation. Our analysis reveals that BERT-like models generally outperform LLMs in classification tasks, while LLMs demonstrate superior robustness against text perturbations. Compared to weak labels (distant supervision) data, the results show that AI labels with human supervision achieve better classification results. This study highlights the effectiveness of combining AI-based annotation with human oversight and demonstrates the performance of different families of machine learning models for fake news detection

Title: PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation

Authors: Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, Di Niu
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.14283
Pdf URL: https://arxiv.org/pdf/2412.14283
Copy Paste: [[2412.14283]] PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation(https://arxiv.org/abs/2412.14283)
Keywords: diffusion
Abstract: Recent research explores the potential of Diffusion Models (DMs) for consistent object editing, which aims to modify object position, size, and composition, etc., while preserving the consistency of objects and background without changing their texture and attributes. Current inference-time methods often rely on DDIM inversion, which inherently compromises efficiency and the achievable consistency of edited images. Recent methods also utilize energy guidance which iteratively updates the predicted noise and can drive the latents away from the original image, resulting in distortions. In this paper, we propose PixelMan, an inversion-free and training-free method for achieving consistent object editing via Pixel Manipulation and generation, where we directly create a duplicate copy of the source object at target location in the pixel space, and introduce an efficient sampling approach to iteratively harmonize the manipulated object into the target location and inpaint its original location, while ensuring image consistency by anchoring the edited image to be generated to the pixel-manipulated image as well as by introducing various consistency-preserving optimization techniques during inference. Experimental evaluations based on benchmark datasets as well as extensive visual comparisons show that in as few as 16 inference steps, PixelMan outperforms a range of state-of-the-art training-based and training-free methods (usually requiring 50 steps) on multiple consistent object editing tasks.

Title: TRecViT: A Recurrent Video Transformer

Authors: Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14294
Pdf URL: https://arxiv.org/pdf/2412.14294
Copy Paste: [[2412.14294]] TRecViT: A Recurrent Video Transformer(https://arxiv.org/abs/2412.14294)
Keywords: self-supervised
Abstract: We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at this https URL.

Title: Personalized Generative Low-light Image Denoising and Enhancement

Authors: Xijun Wang, Prateek Chennuri, Yu Yuan, Bole Ma, Xingguang Zhang, Stanley Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14327
Pdf URL: https://arxiv.org/pdf/2412.14327
Copy Paste: [[2412.14327]] Personalized Generative Low-light Image Denoising and Enhancement(https://arxiv.org/abs/2412.14327)
Keywords: diffusion, generative
Abstract: While smartphone cameras today can produce astonishingly good photos, their performance in low light is still not completely satisfactory because of the fundamental limits in photon shot noise and sensor read noise. Generative image restoration methods have demonstrated promising results compared to traditional methods, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Recognizing the availability of personalized photo galleries on users' smartphones, we propose Personalized Generative Denoising (PGD) by building a diffusion model customized for different users. Our core innovation is an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer provides a strong prior that can be integrated with the diffusion model to restore the degraded images, without the need of fine-tuning. Over a wide range of low-light testing scenarios, we show that PGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches.

Title: Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

Authors: Steven Hogue, Chenxu Zhang, Yapeng Tian, Xiaohu Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14333
Pdf URL: https://arxiv.org/pdf/2412.14333
Copy Paste: [[2412.14333]] Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters(https://arxiv.org/abs/2412.14333)
Keywords: diffusion
Abstract: Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.

Title: A Unifying Information-theoretic Perspective on Evaluating Generative Models

Authors: Alexis Fox, Samarth Swarup, Abhijin Adiga
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14340
Pdf URL: https://arxiv.org/pdf/2412.14340
Copy Paste: [[2412.14340]] A Unifying Information-theoretic Perspective on Evaluating Generative Models(https://arxiv.org/abs/2412.14340)
Keywords: generative
Abstract: Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.

Title: Surrealistic-like Image Generation with Vision-Language Models

Authors: Elif Ayten, Shuai Wang, Hjalmar Snoep
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14366
Pdf URL: https://arxiv.org/pdf/2412.14366
Copy Paste: [[2412.14366]] Surrealistic-like Image Generation with Vision-Language Models(https://arxiv.org/abs/2412.14366)
Keywords: generative
Abstract: Recent advances in generative AI make it convenient to create different types of content, including text, images, and code. In this paper, we explore the generation of images in the style of paintings in the surrealism movement using vision-language generative models, including DALL-E, Deep Dream Generator, and DreamStudio. Our investigation starts with the generation of images under various image generation settings and different models. The primary objective is to identify the most suitable model and settings for producing such images. Additionally, we aim to understand the impact of using edited base images on the generated resulting images. Through these experiments, we evaluate the performance of selected models and gain valuable insights into their capabilities in generating such images. Our analysis shows that Dall-E 2 performs the best when using the generated prompt by ChatGPT.

Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Authors: William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Subjects: cs.CL, eess.SP
Abstract URL: https://arxiv.org/abs/2412.14373
Pdf URL: https://arxiv.org/pdf/2412.14373
Copy Paste: [[2412.14373]] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling(https://arxiv.org/abs/2412.14373)
Keywords: self-supervised, generative
Abstract: Large Language Models (LLMs) have shown remarkable adaptability across domains beyond text, specifically electrocardiograms (ECGs). More specifically, there is a growing body of work exploring the task of generating text from a multi-channeled ECG and corresponding textual prompt. Current approaches typically involve pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective and using the features output by the pretrained encoder to finetune a LLM for natural language generation (NLG). However, these methods are limited by 1) inefficiency from two-stage training and 2) interpretability challenges with encoder-generated features. To address these limitations, we introduce ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. This approach compresses and encodes ECG signals into tokens, enabling end-to-end LLM training by combining ECG and text tokens directly, while being much more interpretable since the ECG tokens can be directly mapped back to the original signal. Using ECG-Byte, we achieve competitive performance in NLG tasks in only half the time and ~48% of the data required by two-stage approaches.

Title: Enhancing Diffusion Models for High-Quality Image Generation

Authors: Jaineet Shah, Michael Gromis, Rickston Pinto
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14422
Pdf URL: https://arxiv.org/pdf/2412.14422
Copy Paste: [[2412.14422]] Enhancing Diffusion Models for High-Quality Image Generation(https://arxiv.org/abs/2412.14422)
Keywords: diffusion, generative
Abstract: This report presents the comprehensive implementation, evaluation, and optimization of Denoising Diffusion Probabilistic Models (DDPMs) and Denoising Diffusion Implicit Models (DDIMs), which are state-of-the-art generative models. During inference, these models take random noise as input and iteratively generate high-quality images as output. The study focuses on enhancing their generative capabilities by incorporating advanced techniques such as Classifier-Free Guidance (CFG), Latent Diffusion Models with Variational Autoencoders (VAE), and alternative noise scheduling strategies. The motivation behind this work is the growing demand for efficient and scalable generative AI models that can produce realistic images across diverse datasets, addressing challenges in applications such as art creation, image synthesis, and data augmentation. Evaluations were conducted on datasets including CIFAR-10 and ImageNet-100, with a focus on improving inference speed, computational efficiency, and image quality metrics like Frechet Inception Distance (FID). Results demonstrate that DDIM + CFG achieves faster inference and superior image quality. Challenges with VAE and noise scheduling are also highlighted, suggesting opportunities for future optimization. This work lays the groundwork for developing scalable, efficient, and high-quality generative AI systems to benefit industries ranging from entertainment to robotics.

Title: FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning

Authors: Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14424
Pdf URL: https://arxiv.org/pdf/2412.14424
Copy Paste: [[2412.14424]] FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning(https://arxiv.org/abs/2412.14424)
Keywords: foundation model
Abstract: Large Vision-Language Models typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these models on end-user devices, such as in medical clinics, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients resulting in suboptimal convergence. To this end, we propose a novel framework called FedPIA that improves upon the naive combinations of FL and PEFT by introducing Permutation and Integration of the local Adapters in the server and global Adapters in the clients exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layerwise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines.

Title: IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

Authors: Anand Kumar, Jiteng Mu, Nuno Vasconcelos
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14432
Pdf URL: https://arxiv.org/pdf/2412.14432
Copy Paste: [[2412.14432]] IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features(https://arxiv.org/abs/2412.14432)
Keywords: diffusion
Abstract: Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns regarding data privacy and copyright infringement among artists. Consequently, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as introspective style attribution (IntroStyle) and demonstrates superior performance to state-of-the-art models for style retrieval. We also introduce a synthetic dataset of Style Hacks (SHacks) to isolate artistic style and evaluate fine-grained style attribution performance.

Title: GenHMR: Generative Human Mesh Recovery

Authors: Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14444
Pdf URL: https://arxiv.org/pdf/2412.14444
Copy Paste: [[2412.14444]] GenHMR: Generative Human Mesh Recovery(https://arxiv.org/abs/2412.14444)
Keywords: generative
Abstract: Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. Project website can be found at this https URL

Title: CLDG: Contrastive Learning on Dynamic Graphs

Authors: Yiming Xu, Bin Shi, Teng Ma, Bo Dong, Haoyi Zhou, Qinghua Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14451
Pdf URL: https://arxiv.org/pdf/2412.14451
Copy Paste: [[2412.14451]] CLDG: Contrastive Learning on Dynamic Graphs(https://arxiv.org/abs/2412.14451)
Keywords: self-supervised
Abstract: The graph with complex annotations is the most potent data type, whose constantly evolving motivates further exploration of the unsupervised dynamic graph representation. One of the representative paradigms is graph contrastive learning. It constructs self-supervised signals by maximizing the mutual information between the statistic graph's augmentation views. However, the semantics and labels may change within the augmentation process, causing a significant performance drop in downstream tasks. This drawback becomes greatly magnified on dynamic graphs. To address this problem, we designed a simple yet effective framework named CLDG. Firstly, we elaborate that dynamic graphs have temporal translation invariance at different levels. Then, we proposed a sampling layer to extract the temporally-persistent signals. It will encourage the node to maintain consistent local and global representations, i.e., temporal translation invariance under the timespan views. The extensive experiments demonstrate the effectiveness and efficiency of the method on seven datasets by outperforming eight unsupervised state-of-the-art baselines and showing competitiveness against four semi-supervised methods. Compared with the existing dynamic graph method, the number of model parameters and training time is reduced by an average of 2,001.86 times and 130.31 times on seven datasets, respectively.

Title: Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

Authors: Shengqi Liu, Yuhao Cheng, Zhuo Chen, Xingyu Ren, Wenhan Zhu, Lincheng Li, Mengxiao Bi, Xiaokang Yang, Yichao Yan
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14453
Pdf URL: https://arxiv.org/pdf/2412.14453
Copy Paste: [[2412.14453]] Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation(https://arxiv.org/abs/2412.14453)
Keywords: diffusion, generative
Abstract: Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: this https URL.

Title: LEDiff: Latent Exposure Diffusion for HDR Generation

Authors: Chao Wang, Zhihao Xia, Thomas Leimkuehler, Karol Myszkowski, Xuaner Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14456
Pdf URL: https://arxiv.org/pdf/2412.14456
Copy Paste: [[2412.14456]] LEDiff: Latent Exposure Diffusion for HDR Generation(https://arxiv.org/abs/2412.14456)
Keywords: diffusion, generative
Abstract: While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.

Title: From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research

Authors: Xiang Cheng, Raveesh Mayya, João Sedoc
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14461
Pdf URL: https://arxiv.org/pdf/2412.14461
Copy Paste: [[2412.14461]] From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research(https://arxiv.org/abs/2412.14461)
Keywords: generative
Abstract: Unstructured text data annotation and analysis are fundamental to management research, often relying on human annotators through crowdsourcing platforms. While Large Language Models (LLMs) promise to provide a cost-effective and efficient alternative to human annotation, there lacks a systematic workflow that evaluate when LLMs are suitable or how to proceed with LLM-based text annotation in a reproducible manner. This paper addresses this methodological gap by introducing the ``SILICON" (\textbf{S}ystematic \textbf{I}nference with \textbf{L}LMs for \textbf{I}nformation \textbf{C}lassificati\textbf{o}n and \textbf{N}otation) workflow. The workflow integrates established principles of human annotation with systematic prompt optimization and model selection, addressing challenges such as developing robust annotation guidelines, establishing high-quality human baselines, optimizing prompts, and ensuring reproducibility across LLMs. We validate the SILICON workflow through seven case studies covering common management research tasks, including business proposal evaluation, dialog intent and breakdown analysis, review attribute detection. Our findings highlight the importance of validating annotation guideline agreement, the superiority of expert-developed human baselines over crowdsourced ones, the iterative nature of prompt optimization, and the necessity of testing multiple LLMs. Notably, we propose a regression-based methodology to empirically compare LLM outputs across prompts and models. Our workflow advances management research by establishing reproducible processes for LLM-based annotation that maintain scientific rigor. We provide practical guidance for researchers to effectively navigate the evolving landscape of generative AI tools effectively while maintaining transparency and reproducibility.

Title: Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

Authors: Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14462
Pdf URL: https://arxiv.org/pdf/2412.14462
Copy Paste: [[2412.14462]] Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion(https://arxiv.org/abs/2412.14462)
Keywords: diffusion
Abstract: As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on this https URL.

Title: LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations

Authors: Tung Do, Thuan Hoang Nguyen, Anh Tuan Tran, Rang Nguyen, Binh-Son Hua
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.14464
Pdf URL: https://arxiv.org/pdf/2412.14464
Copy Paste: [[2412.14464]] LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations(https://arxiv.org/abs/2412.14464)
Keywords: diffusion
Abstract: We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.

Title: DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

Authors: Wengyi Zhan, Mingbao Lin, Shuicheng Yan, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14465
Pdf URL: https://arxiv.org/pdf/2412.14465
Copy Paste: [[2412.14465]] DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On(https://arxiv.org/abs/2412.14465)
Keywords: diffusion
Abstract: We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model. This initial foray into the application of untrained diffusion models in virtual try-on technology potentially paves the way for further exploration and refinement in this industrially and academically valuable field.

Title: Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

Authors: Chuang Lin, Bingbing Zhuang, Shanlin Sun, Ziyu Jiang, Jianfei Cai, Manmohan Chandraker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14494
Pdf URL: https://arxiv.org/pdf/2412.14494
Copy Paste: [[2412.14494]] Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles(https://arxiv.org/abs/2412.14494)
Keywords: diffusion
Abstract: The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates a set of good practices to finetune large pretrained models for a real-world task -- harvesting vehicle assets for autonomous driving applications. To this end, we delve into the discrepancies between the synthetic data and real driving data, then develop several strategies to account for them properly. Specifically, we start with a virtual camera rotation of real images to ensure geometric alignment with synthetic data and consistency with the pose manifold defined by pretrained models. We also identify important design choices in object-centric data curation to account for varying object distances in real driving scenes -- learn across varying object scales with fixed camera focal length. Further, we perform occlusion-aware training in latent spaces to account for ubiquitous occlusions in real data, and handle large viewpoint changes by leveraging a symmetric prior. Our insights lead to effective finetuning that results in a $68.8\%$ reduction in FID for novel view synthesis over prior arts.

Title: Content-style disentangled representation for controllable artistic image stylization and generation

Authors: Ma Zhuoqi, Zhang Yixuan, You Zejun, Tian Long, Liu Xiyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14496
Pdf URL: https://arxiv.org/pdf/2412.14496
Copy Paste: [[2412.14496]] Content-style disentangled representation for controllable artistic image stylization and generation(https://arxiv.org/abs/2412.14496)
Keywords: diffusion
Abstract: Controllable artistic image stylization and generation aims to render the content provided by text or image with the learned artistic style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image information for supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in semantic interference from the reference image. To address the above issues, this paper proposes a content-style representation disentangling method for controllable artistic image stylization and generation. We construct a WikiStyle+ dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled content and style representations guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers for better controllable stylization. This approach allows model to accommodate inputs from different modalities. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling a harmonious integration of content and style in the generated outputs, successfully producing style-consistent and expressive stylized images.

Title: Guided Diffusion Model for Sensor Data Obfuscation

Authors: Xin Yang, Omid Ardakanian
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14499
Pdf URL: https://arxiv.org/pdf/2412.14499
Copy Paste: [[2412.14499]] Guided Diffusion Model for Sensor Data Obfuscation(https://arxiv.org/abs/2412.14499)
Keywords: diffusion, generative
Abstract: Sensor data collected by Internet of Things (IoT) devices carries detailed information about individuals in their vicinity. Sharing this data with a semi-trusted service provider may compromise the individuals' privacy, as sensitive information can be extracted by powerful machine learning models. Data obfuscation empowered by generative models is a promising approach to generate synthetic sensor data such that the useful information contained in the original data is preserved and the sensitive information is obscured. This newly generated data will then be shared with the service provider instead of the original sensor data. In this work, we propose PrivDiffuser, a novel data obfuscation technique based on a denoising diffusion model that attains a superior trade-off between data utility and privacy through effective guidance techniques. Specifically, we extract latent representations that contain information about public and private attributes from sensor data to guide the diffusion model, and impose mutual information-based regularization when learning the latent representations to alleviate the entanglement of public and private attributes, thereby increasing the effectiveness of guidance. Evaluation on three real-world datasets containing different sensing modalities reveals that PrivDiffuser yields a better privacy-utility trade-off than the state-of-the-art obfuscation model, decreasing the utility loss by up to $1.81\%$ and the privacy loss by up to $3.42\%$. Moreover, we showed that users with diverse privacy needs can use PrivDiffuser to protect their privacy without having to retrain the model.

Title: Efficient Self-Supervised Video Hashing with Selective State Spaces

Authors: Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2412.14518
Pdf URL: https://arxiv.org/pdf/2412.14518
Copy Paste: [[2412.14518]] Efficient Self-Supervised Video Hashing with Selective State Spaces(https://arxiv.org/abs/2412.14518)
Keywords: self-supervised
Abstract: Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at this https URL.

Title: Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Authors: Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14528
Pdf URL: https://arxiv.org/pdf/2412.14528
Copy Paste: [[2412.14528]] Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models(https://arxiv.org/abs/2412.14528)
Keywords: generative
Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.

Title: Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Authors: Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14531
Pdf URL: https://arxiv.org/pdf/2412.14531
Copy Paste: [[2412.14531]] Consistent Human Image and Video Generation with Spatially Conditioned Diffusion(https://arxiv.org/abs/2412.14531)
Keywords: diffusion
Abstract: Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.

Title: ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting

Authors: Qi Zheng, Zihao Yao, Yaying Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14537
Pdf URL: https://arxiv.org/pdf/2412.14537
Copy Paste: [[2412.14537]] ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting(https://arxiv.org/abs/2412.14537)
Keywords: self-supervised
Abstract: Spatial-temporal forecasting is crucial and widely applicable in various domains such as traffic, energy, and climate. Benefiting from the abundance of unlabeled spatial-temporal data, self-supervised methods are increasingly adapted to learn spatial-temporal representations. However, it encounters three key challenges: 1) the difficulty in selecting reliable negative pairs due to the homogeneity of variables, hindering contrastive learning methods; 2) overlooking spatial correlations across variables over time; 3) limitations of efficiency and scalability in existing self-supervised learning methods. To tackle these, we propose a lightweight representation-learning model ST-ReP, integrating current value reconstruction and future value prediction into the pre-training framework for spatial-temporal forecasting. And we design a new spatial-temporal encoder to model fine-grained relationships. Moreover, multi-time scale analysis is incorporated into the self-supervised loss to enhance predictive capability. Experimental results across diverse domains demonstrate that the proposed model surpasses pre-training-based baselines, showcasing its ability to learn compact and semantically enriched representations while exhibiting superior scalability.

Title: Downscaling Precipitation with Bias-informed Conditional Diffusion Model

Authors: Ran Lyu (1), Linhan Wang (1), Yanshen Sun (1), Hedanqiu Bai (2), Chang-Tien Lu (1) ((1) Virginia Tech, (2) Texas A&M University)
Subjects: cs.LG, cs.CV, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.14539
Pdf URL: https://arxiv.org/pdf/2412.14539
Copy Paste: [[2412.14539]] Downscaling Precipitation with Bias-informed Conditional Diffusion Model(https://arxiv.org/abs/2412.14539)
Keywords: diffusion
Abstract: Climate change is intensifying rainfall extremes, making high-resolution precipitation projections crucial for society to better prepare for impacts such as flooding. However, current Global Climate Models (GCMs) operate at spatial resolutions too coarse for localized analyses. To address this limitation, deep learning-based statistical downscaling methods offer promising solutions, providing high-resolution precipitation projections with a moderate computational cost. In this work, we introduce a bias-informed conditional diffusion model for statistical downscaling of precipitation. Specifically, our model leverages a conditional diffusion approach to learn distribution priors from large-scale, high-resolution precipitation datasets. The long-tail distribution of precipitation poses a unique challenge for training diffusion models; to address this, we apply gamma correction during preprocessing. Additionally, to correct biases in the downscaled results, we employ a guided-sampling strategy to enhance bias correction. Our experiments demonstrate that the proposed model achieves highly accurate results in an 8 times downscaling setting, outperforming previous deterministic methods. The code and dataset are available at this https URL

Title: Global Spatio-Temporal Fusion-based Traffic Prediction Algorithm with Anomaly Aware

Authors: Chaoqun Liu, Xuanpeng Li, Chen Gong, Guangyu Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14569
Pdf URL: https://arxiv.org/pdf/2412.14569
Copy Paste: [[2412.14569]] Global Spatio-Temporal Fusion-based Traffic Prediction Algorithm with Anomaly Aware(https://arxiv.org/abs/2412.14569)
Keywords: anomaly
Abstract: Traffic prediction is an indispensable component of urban planning and traffic management. Achieving accurate traffic prediction hinges on the ability to capture the potential spatio-temporal relationships among road sensors. However, the majority of existing works focus on local short-term spatio-temporal correlations, failing to fully consider the interactions of different sensors in the long-term state. In addition, these works do not analyze the influences of anomalous factors, or have insufficient ability to extract personalized features of anomalous factors, which make them ineffectively capture their spatio-temporal influences on traffic prediction. To address the aforementioned issues, We propose a global spatio-temporal fusion-based traffic prediction algorithm that incorporates anomaly awareness. Initially, based on the designed anomaly detection network, we construct an efficient anomalous factors impacting module (AFIM), to evaluate the spatio-temporal impact of unexpected external events on traffic prediction. Furthermore, we propose a multi-scale spatio-temporal feature fusion module (MTSFFL) based on the transformer architecture, to obtain all possible both long and short term correlations among different sensors in a wide-area traffic environment for accurate prediction of traffic flow. Finally, experiments are implemented based on real-scenario public transportation datasets (PEMS04 and PEMS08) to demonstrate that our approach can achieve state-of-the-art performance.

Title: DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Authors: Yiren Song, Xiaokang Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14580
Pdf URL: https://arxiv.org/pdf/2412.14580
Copy Paste: [[2412.14580]] DiffSim: Taming Diffusion Models for Evaluating Visual Similarity(https://arxiv.org/abs/2412.14580)
Keywords: diffusion, self-supervised, generative
Abstract: Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.

Title: Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties

Authors: Wenqiao Li, Bozhong Zheng, Xiaohao Xu, Jinye Gan, Fading Lu, Xiang Li, Na Ni, Zheng Tian, Xiaonan Huang, Shenghua Gao, Yingna Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14592
Pdf URL: https://arxiv.org/pdf/2412.14592
Copy Paste: [[2412.14592]] Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties(https://arxiv.org/abs/2412.14592)
Keywords: anomaly
Abstract: Object anomaly detection is essential for industrial quality inspection, yet traditional single-sensor methods face critical limitations. They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. MulSen-AD unifies data from RGB cameras, laser scanners, and lock-in infrared thermography, effectively capturing external appearance, geometric deformations, and internal defects. The dataset spans 15 industrial products with diverse, real-world anomalies. We also present MulSen-AD Bench, a benchmark designed to evaluate multi-sensor methods, and propose MulSen-TripleAD, a decision-level fusion algorithm that integrates these three modalities for robust, unsupervised object anomaly detection. Our experiments demonstrate that multi-sensor fusion substantially outperforms single-sensor approaches, achieving 96.1% AUROC in object-level detection accuracy. These results highlight the importance of integrating multi-sensor data for comprehensive industrial anomaly detection.

Title: LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Authors: Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14596
Pdf URL: https://arxiv.org/pdf/2412.14596
Copy Paste: [[2412.14596]] LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining(https://arxiv.org/abs/2412.14596)
Keywords: diffusion
Abstract: Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

Title: Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models

Authors: Keith G. Mills, Mohammad Salameh, Ruichen Chen, Negar Hassanpour, Wei Lu, Di Niu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14628
Pdf URL: https://arxiv.org/pdf/2412.14628
Copy Paste: [[2412.14628]] Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models(https://arxiv.org/abs/2412.14628)
Keywords: diffusion, generative
Abstract: Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional U-Nets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua$^2$SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua$^2$SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt-${\alpha}$, PixArt-${\Sigma}$, Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.

Title: Robust PCA Based on Adaptive Weighted Least Squares and Low-Rank Matrix Factorization

Authors: Kexin Li, You-wei Wen, Xu Xiao, Mingchao Zhao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14629
Pdf URL: https://arxiv.org/pdf/2412.14629
Copy Paste: [[2412.14629]] Robust PCA Based on Adaptive Weighted Least Squares and Low-Rank Matrix Factorization(https://arxiv.org/abs/2412.14629)
Keywords: anomaly
Abstract: Robust Principal Component Analysis (RPCA) is a fundamental technique for decomposing data into low-rank and sparse components, which plays a critical role for applications such as image processing and anomaly detection. Traditional RPCA methods commonly use $\ell_1$ norm regularization to enforce sparsity, but this approach can introduce bias and result in suboptimal estimates, particularly in the presence of significant noise or outliers. Non-convex regularization methods have been proposed to mitigate these challenges, but they tend to be complex to optimize and sensitive to initial conditions, leading to potential instability in solutions. To overcome these challenges, in this paper, we propose a novel RPCA model that integrates adaptive weighted least squares (AWLS) and low-rank matrix factorization (LRMF). The model employs a {self-attention-inspired} mechanism in its weight update process, allowing the weight matrix to dynamically adjust and emphasize significant components during each iteration. By employing a weighted F-norm for the sparse component, our method effectively reduces bias while simplifying the computational process compared to traditional $\ell_1$-norm-based methods. We use an alternating minimization algorithm, where each subproblem has an explicit solution, thereby improving computational efficiency. Despite its simplicity, numerical experiments demonstrate that our method outperforms existing non-convex regularization approaches, offering superior performance and stability, as well as enhanced accuracy and robustness in practical applications.

Title: Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model

Authors: Minglong Xue, Jinhong He, Shivakumara Palaiahnakote, Mingliang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14630
Pdf URL: https://arxiv.org/pdf/2412.14630
Copy Paste: [[2412.14630]] Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model(https://arxiv.org/abs/2412.14630)
Keywords: diffusion
Abstract: Image restoration and enhancement are pivotal for numerous computer vision applications, yet unifying these tasks efficiently remains a significant challenge. Inspired by the iterative refinement capabilities of diffusion models, we propose CycleRDM, a novel framework designed to unify restoration and enhancement tasks while achieving high-quality mapping. Specifically, CycleRDM first learns the mapping relationships among the degraded domain, the rough normal domain, and the normal domain through a two-stage diffusion inference process. Subsequently, we transfer the final calibration process to the wavelet low-frequency domain using discrete wavelet transform, performing fine-grained calibration from a frequency domain perspective by leveraging task-specific frequency spaces. To improve restoration quality, we design a feature gain module for the decomposed wavelet high-frequency domain to eliminate redundant features. Additionally, we employ multimodal textual prompts and Fourier transform to drive stable denoising and reduce randomness during the inference process. After extensive validation, CycleRDM can be effectively generalized to a wide range of image restoration and enhancement tasks while requiring only a small number of training samples to be significantly superior on various benchmarks of reconstruction quality and perceptual quality. The source code will be available at this https URL.

Title: Event-assisted 12-stop HDR Imaging of Dynamic Scene

Authors: Shi Guo, Zixuan Chen, Ziran Zhang, Yutian Chen, Gangwei Xu, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14705
Pdf URL: https://arxiv.org/pdf/2412.14705
Copy Paste: [[2412.14705]] Event-assisted 12-stop HDR Imaging of Dynamic Scene(https://arxiv.org/abs/2412.14705)
Keywords: diffusion
Abstract: High dynamic range (HDR) imaging is a crucial task in computational photography, which captures details across diverse lighting conditions. Traditional HDR fusion methods face limitations in dynamic scenes with extreme exposure differences, as aligning low dynamic range (LDR) frames becomes challenging due to motion and brightness variation. In this work, we propose a novel 12-stop HDR imaging approach for dynamic scenes, leveraging a dual-camera system with an event camera and an RGB camera. The event camera provides temporally dense, high dynamic range signals that improve alignment between LDR frames with large exposure differences, reducing ghosting artifacts caused by motion. Also, a real-world finetuning strategy is proposed to increase the generalization of alignment module on real-world events. Additionally, we introduce a diffusion-based fusion module that incorporates image priors from pre-trained diffusion models to address artifacts in high-contrast regions and minimize errors from the alignment process. To support this work, we developed the ESHDR dataset, the first dataset for 12-stop HDR imaging with synchronized event signals, and validated our approach on both simulated and real-world data. Extensive experiments demonstrate that our method achieves state-of-the-art performance, successfully extending HDR imaging to 12 stops in dynamic scenes.

Title: EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

Authors: Jianrong Zhang, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14706
Pdf URL: https://arxiv.org/pdf/2412.14706
Copy Paste: [[2412.14706]] EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space(https://arxiv.org/abs/2412.14706)
Keywords: diffusion
Abstract: Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.

Title: Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data

Authors: Fabian Sven Karst, Sook-Yee Chong, Abigail A. Antenor, Enyu Lin, Mahei Manhai Li, Jan Marco Leimeister
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14730
Pdf URL: https://arxiv.org/pdf/2412.14730
Copy Paste: [[2412.14730]] Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data(https://arxiv.org/abs/2412.14730)
Keywords: diffusion, generative
Abstract: The banking sector faces challenges in using deep learning due to data sensitivity and regulatory constraints, but generative AI may offer a solution. Thus, this study identifies effective algorithms for generating synthetic financial transaction data and evaluates five leading models - Conditional Tabular Generative Adversarial Networks (CTGAN), DoppelGANger (DGAN), Wasserstein GAN, Financial Diffusion (FinDiff), and Tabular Variational AutoEncoders (TVAE) - across five criteria: fidelity, synthesis quality, efficiency, privacy, and graph structure. While none of the algorithms is able to replicate the real data's graph structure, each excels in specific areas: DGAN is ideal for privacy-sensitive tasks, FinDiff and TVAE excel in data replication and augmentation, and CTGAN achieves a balance across all five criteria, making it suitable for general applications with moderate privacy concerns. As a result, our findings offer valuable insights for choosing the most suitable algorithm.

Title: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Authors: Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.14803
Pdf URL: https://arxiv.org/pdf/2412.14803
Copy Paste: [[2412.14803]] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations(https://arxiv.org/abs/2412.14803)
Keywords: diffusion
Abstract: Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.

Title: Explainable Tampered Text Detection via Multimodal Large Models

Authors: Chenfan Qu, Jian Liu, Haoxing Chen, Baihan Yu, Jingjing Liu, Weiqiang Wang, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14816
Pdf URL: https://arxiv.org/pdf/2412.14816
Copy Paste: [[2412.14816]] Explainable Tampered Text Detection via Multimodal Large Models(https://arxiv.org/abs/2412.14816)
Keywords: anomaly
Abstract: Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this black-box problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations indicating the tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, a fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. By weighting the input image with the mask annotation, the tampered region can be clearly indicated and the content in and around the tampered region can also be preserved. We also propose prompting GPT4o to recognize tampered texts and filtering out the responses with low OCR accuracy, which can effectively improve annotation quality in an automatic manner. To further improve explainable tampered text detection, we propose a simple yet effective model called TTD, which benefits from improved fine-grained perception by paying attention to the suspected region with auxiliary reference grounding query. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. The dataset and code will be made publicly available.

Title: DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis

Authors: Hongling Xu, Yice Zhang, Qianlong Wang, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14849
Pdf URL: https://arxiv.org/pdf/2412.14849
Copy Paste: [[2412.14849]] DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2412.14849)
Keywords: in-context
Abstract: Recently developed large language models (LLMs) have presented promising new avenues to address data scarcity in low-resource scenarios. In few-shot aspect-based sentiment analysis (ABSA), previous efforts have explored data augmentation techniques, which prompt LLMs to generate new samples by modifying existing ones. However, these methods fail to produce adequately diverse data, impairing their effectiveness. Besides, some studies apply in-context learning for ABSA by using specific instructions and a few selected examples as prompts. Though promising, LLMs often yield labels that deviate from task requirements. To overcome these limitations, we propose DS$^2$-ABSA, a dual-stream data synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize data from two complementary perspectives: \textit{key-point-driven} and \textit{instance-driven}, which effectively generate diverse and high-quality ABSA samples in low-resource settings. Furthermore, a \textit{label refinement} module is integrated to improve the synthetic labels. Extensive experiments demonstrate that DS$^2$-ABSA significantly outperforms previous few-shot ABSA solutions and other LLM-oriented data generation methods.

Title: Zero-Shot Artifact2Artifact: Self-incentive artifact removal for photoacoustic imaging without any data

Authors: Shuang Li, Qian Chen, Chulhong Kim, Seongwook Choi, Yibing Wang, Yu Zhang, Changhui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14873
Pdf URL: https://arxiv.org/pdf/2412.14873
Copy Paste: [[2412.14873]] Zero-Shot Artifact2Artifact: Self-incentive artifact removal for photoacoustic imaging without any data(https://arxiv.org/abs/2412.14873)
Keywords: self-supervised
Abstract: Photoacoustic imaging (PAI) uniquely combines optical contrast with the penetration depth of ultrasound, making it critical for clinical applications. However, the quality of 3D PAI is often degraded due to reconstruction artifacts caused by the sparse and angle-limited configuration of detector arrays. Existing iterative or deep learning-based methods are either time-consuming or require large training datasets, significantly limiting their practical application. Here, we propose Zero-Shot Artifact2Artifact (ZS-A2A), a zero-shot self-supervised artifact removal method based on a super-lightweight network, which leverages the fact that reconstruction artifacts are sensitive to irregularities caused by data loss. By introducing random perturbations to the acquired PA data, it spontaneously generates subset data, which in turn stimulates the network to learn the artifact patterns in the reconstruction results, thus enabling zero-shot artifact removal. This approach requires neither training data nor prior knowledge of the artifacts, and is capable of artifact removal for 3D PAI. For maximum amplitude projection (MAP) images or slice images in 3D PAI acquired with arbitrarily sparse or angle-limited detector arrays, ZS-A2A employs a self-incentive strategy to complete artifact removal and improves the Contrast-to-Noise Ratio (CNR). We validated ZS-A2A in both simulation study and $ in\ vivo $ animal experiments. Results demonstrate that ZS-A2A achieves state-of-the-art (SOTA) performance compared to existing zero-shot methods, and for the $ in\ vivo $ rat liver, ZS-A2A improves CNR from 17.48 to 43.46 in just 8 seconds. The project for ZS-A2A will be available in the following GitHub repository: this https URL.

Title: Diffusion priors for Bayesian 3D reconstruction from incomplete measurements

Authors: Julian L. Möbius, Michael Habeck
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14897
Pdf URL: https://arxiv.org/pdf/2412.14897
Copy Paste: [[2412.14897]] Diffusion priors for Bayesian 3D reconstruction from incomplete measurements(https://arxiv.org/abs/2412.14897)
Keywords: diffusion
Abstract: Many inverse problems are ill-posed and need to be complemented by prior information that restricts the class of admissible models. Bayesian approaches encode this information as prior distributions that impose generic properties on the model such as sparsity, non-negativity or smoothness. However, in case of complex structured models such as images, graphs or three-dimensional (3D) objects,generic prior distributions tend to favor models that differ largely from those observed in the real world. Here we explore the use of diffusion models as priors that are combined with experimental data within a Bayesian framework. We use 3D point clouds to represent 3D objects such as household items or biomolecular complexes formed from proteins and nucleic acids. We train diffusion models that generate coarse-grained 3D structures at a medium resolution and integrate these with incomplete and noisy experimental data. To demonstrate the power of our approach, we focus on the reconstruction of biomolecular assemblies from cryo-electron microscopy (cryo-EM) images, which is an important inverse problem in structural biology. We find that posterior sampling with diffusion model priors allows for 3D reconstruction from very sparse, low-resolution and partial observations.

Title: MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models

Authors: Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Hunag, Yuhua Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14902
Pdf URL: https://arxiv.org/pdf/2412.14902
Copy Paste: [[2412.14902]] MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models(https://arxiv.org/abs/2412.14902)
Keywords: diffusion
Abstract: Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a "Name Space", where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models. Project homepage: \url{this https URL}.

Title: Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation

Authors: Zexiong Ma, Shengnan An, Zeqi Lin, Yanzhen Zou, Jian-Guang Lou, Bing Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14905
Pdf URL: https://arxiv.org/pdf/2412.14905
Copy Paste: [[2412.14905]] Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation(https://arxiv.org/abs/2412.14905)
Keywords: in-context
Abstract: Large language models (LLMs) are susceptible to generating hallucinated information, despite the integration of retrieval-augmented generation (RAG). Parallel context extension (PCE) is a line of research attempting to effectively integrating parallel (unordered) contexts, while it still suffers from hallucinations when adapted to RAG scenarios. In this paper, we propose DePaC (Dehallucinating Parallel Context Extension), which alleviates the hallucination problem with context-aware negative training and information-calibrated aggregation. DePaC is designed to alleviate two types of in-context hallucination: fact fabrication (i.e., LLMs present claims that are not supported by the contexts) and fact omission (i.e., LLMs fail to present claims that can be supported by the contexts). Specifically, (1) for fact fabrication, we apply the context-aware negative training that fine-tunes the LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to answer when contexts are not related to questions; (2) for fact omission, we propose the information-calibrated aggregation which prioritizes context windows with higher information increment from their contexts. The experimental results on nine RAG tasks demonstrate that DePaC significantly alleviates the two types of hallucination and consistently achieves better performances on these tasks.

Title: DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space

Authors: Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneš, Albert Ali Salah, Itir Onal Ertugrul
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.15032
Pdf URL: https://arxiv.org/pdf/2412.15032
Copy Paste: [[2412.15032]] DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space(https://arxiv.org/abs/2412.15032)
Keywords: diffusion, generative
Abstract: This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to high-resolution generation without using the latent diffusion paradigm. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at \url{this https URL}.

Title: Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion

Authors: Zhifei Chen, Tianshuo Xu, Wenhang Ge, Leyi Wu, Dongyu Yan, Jing He, Luozhou Wang, Lu Zeng, Shunsi Zhang, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15050
Pdf URL: https://arxiv.org/pdf/2412.15050
Copy Paste: [[2412.15050]] Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion(https://arxiv.org/abs/2412.15050)
Keywords: diffusion
Abstract: Rendering and inverse rendering are pivotal tasks in both computer vision and graphics. The rendering equation is the core of the two tasks, as an ideal conditional distribution transfer function from intrinsic properties to RGB images. Despite achieving promising results of existing rendering methods, they merely approximate the ideal estimation for a specific scene and come with a high computational cost. Additionally, the inverse conditional distribution transfer is intractable due to the inherent ambiguity. To address these challenges, we propose a data-driven method that jointly models rendering and inverse rendering as two conditional generation tasks within a single diffusion framework. Inspired by UniDiffuser, we utilize two distinct time schedules to model both tasks, and with a tailored dual streaming module, we achieve cross-conditioning of two pre-trained diffusion models. This unified approach, named Uni-Renderer, allows the two processes to facilitate each other through a cycle-consistent constrain, mitigating ambiguity by enforcing consistency between intrinsic properties and rendered images. Combined with a meticulously prepared dataset, our method effectively decomposition of intrinsic properties and demonstrates a strong capability to recognize changes during rendering. We will open-source our training and inference code to the public, fostering further research and development in this area.

Title: MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance

Authors: Hallee E. Wong, Jose Javier Gonzalez Ortiz, John Guttag, Adrian V. Dalca
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.15058
Pdf URL: https://arxiv.org/pdf/2412.15058
Copy Paste: [[2412.15058]] MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance(https://arxiv.org/abs/2412.15058)
Keywords: in-context
Abstract: Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of manually labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, using MultiverSeg reduced the total number of scribble steps by 53% and clicks by 36% to achieve 90% Dice on sets of images from unseen tasks. We release code and model weights at this https URL

Title: Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation

Authors: Haoran Liu, Youzhi Luo, Tianxiao Li, James Caverlee, Martin Renqiang Min
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15086
Pdf URL: https://arxiv.org/pdf/2412.15086
Copy Paste: [[2412.15086]] Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation(https://arxiv.org/abs/2412.15086)
Keywords: generative
Abstract: We consider the conditional generation of 3D drug-like molecules with \textit{explicit control} over molecular properties such as drug-like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment-based coordinate loss to adapt equivariant networks for auto-regressive de-novo 3D molecule generation from scratch. Extensive experiments validate our model's effectiveness on property-guided and context-guided molecule generation, both for de-novo 3D molecule design and structure-based drug discovery against protein targets.

Title: Jet: A Modern Transformer-Based Normalizing Flow

Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15129
Pdf URL: https://arxiv.org/pdf/2412.15129
Copy Paste: [[2412.15129]] Jet: A Modern Transformer-Based Normalizing Flow(https://arxiv.org/abs/2412.15129)
Keywords: diffusion, generative
Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.

Title: Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Authors: Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.15156
Pdf URL: https://arxiv.org/pdf/2412.15156
Copy Paste: [[2412.15156]] Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM(https://arxiv.org/abs/2412.15156)
Keywords: diffusion
Abstract: Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.

Title: OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Authors: Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15159
Pdf URL: https://arxiv.org/pdf/2412.15159
Copy Paste: [[2412.15159]] OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization(https://arxiv.org/abs/2412.15159)
Keywords: diffusion
Abstract: In recent years, the field of text-to-video (T2V) generation has made significant strides. Despite this progress, there is still a gap between theoretical advancements and practical application, amplified by issues like degraded image quality and flickering artifacts. Recent advancements in enhancing the video diffusion model (VDM) through feedback learning have shown promising results. However, these methods still exhibit notable limitations, such as misaligned feedback and inferior scalability. To tackle these issues, we introduce OnlineVPO, a more efficient preference learning approach tailored specifically for video diffusion models. Our method features two novel designs, firstly, instead of directly using image-based reward feedback, we leverage the video quality assessment (VQA) model trained on synthetic data as the reward model to provide distribution and modality-aligned feedback on the video diffusion model. Additionally, we introduce an online DPO algorithm to address the off-policy optimization and scalability issue in existing video preference learning frameworks. By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and more importantly scalable preference learning algorithm for video diffusion models, offering valuable insights for future advancements in this domain.

Title: Tiled Diffusion

Authors: Or Madar, Ohad Fried
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15185
Pdf URL: https://arxiv.org/pdf/2412.15185
Copy Paste: [[2412.15185]] Tiled Diffusion(https://arxiv.org/abs/2412.15185)
Keywords: diffusion, generative
Abstract: Image tiling -- the seamless connection of disparate images to create a coherent visual field -- is crucial for applications such as texture creation, video game asset development, and digital art. Traditionally, tiles have been constructed manually, a method that poses significant limitations in scalability and flexibility. Recent research has attempted to automate this process using generative models. However, current approaches primarily focus on tiling textures and manipulating models for single-image generation, without inherently supporting the creation of multiple interconnected tiles across diverse domains. This paper presents Tiled Diffusion, a novel approach that extends the capabilities of diffusion models to accommodate the generation of cohesive tiling patterns across various domains of image synthesis that require tiling. Our method supports a wide range of tiling scenarios, from self-tiling to complex many-to-many connections, enabling seamless integration of multiple images. Tiled Diffusion automates the tiling process, eliminating the need for manual intervention and enhancing creative possibilities in various applications, such as seamlessly tiling of existing images, tiled texture creation, and 360° synthesis.

Title: LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15188
Pdf URL: https://arxiv.org/pdf/2412.15188
Copy Paste: [[2412.15188]] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation(https://arxiv.org/abs/2412.15188)
Keywords: diffusion, generative
Abstract: We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.

Title: AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.15191
Pdf URL: https://arxiv.org/pdf/2412.15191
Copy Paste: [[2412.15191]] AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation(https://arxiv.org/abs/2412.15191)
Keywords: diffusion
Abstract: We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: this http URL

Title: DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

Authors: Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.15200
Pdf URL: https://arxiv.org/pdf/2412.15200
Copy Paste: [[2412.15200]] DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation(https://arxiv.org/abs/2412.15200)
Keywords: diffusion
Abstract: Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.

Title: LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Authors: Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15204
Pdf URL: https://arxiv.org/pdf/2412.15204
Copy Paste: [[2412.15204]] LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks(https://arxiv.org/abs/2412.15204)
Keywords: in-context
Abstract: This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at this https URL.

Title: Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

Authors: Hadi Alzayer, Philipp Henzler, Jonathan T. Barron, Jia-Bin Huang, Pratul P. Srinivasan, Dor Verbin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15211
Pdf URL: https://arxiv.org/pdf/2412.15211
Copy Paste: [[2412.15211]] Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation(https://arxiv.org/abs/2412.15211)
Keywords: diffusion, generative
Abstract: Reconstructing the geometry and appearance of objects from photographs taken in different environments is difficult as the illumination and therefore the object appearance vary across captured images. This is particularly challenging for more specular objects whose appearance strongly depends on the viewing direction. Some prior approaches model appearance variation across images using a per-image embedding vector, while others use physically-based rendering to recover the materials and per-image illumination. Such approaches fail at faithfully recovering view-dependent appearance given significant variation in input illumination and tend to produce mostly diffuse results. We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object's geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. We validate our proposed approach on both synthetic and real datasets and demonstrate that it greatly outperforms existing techniques at reconstructing high-fidelity appearance from images taken under extreme illumination variation. Moreover, our approach is particularly effective at recovering view-dependent "shiny" appearance which cannot be reconstructed by prior methods.

Title: Scaling 4D Representations

Authors: João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15212
Pdf URL: https://arxiv.org/pdf/2412.15212
Copy Paste: [[2412.15212]] Scaling 4D Representations(https://arxiv.org/abs/2412.15212)
Keywords: self-supervised
Abstract: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

Title: Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Authors: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15213
Pdf URL: https://arxiv.org/pdf/2412.15213
Copy Paste: [[2412.15213]] Flowing from Words to Pixels: A Framework for Cross-Modality Evolution(https://arxiv.org/abs/2412.15213)
Keywords: diffusion
Abstract: Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Title: LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15214
Pdf URL: https://arxiv.org/pdf/2412.15214
Copy Paste: [[2412.15214]] LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis(https://arxiv.org/abs/2412.15214)
Keywords: diffusion
Abstract: The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: this https URL