2024-03-12

Title: Audio-Synchronized Visual Animation

Authors: Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.05659
Pdf URL: https://arxiv.org/pdf/2403.05659
Copy Paste: [[2403.05659]] Audio-Synchronized Visual Animation(https://arxiv.org/abs/2403.05659)
Keywords: diffusion
Abstract: Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our models superior performance. We further explore AVSyncDs potential in a variety of audio synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation. More videos on project webpage https://lzhangbj.github.io/projects/asva/asva.html.

Title: DP-TabICL: In-Context Learning with Differentially Private Tabular Data

Authors: Alycia N. Carey, Karuna Bhaila, Kennedy Edemacu, Xintao Wu
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.05681
Pdf URL: https://arxiv.org/pdf/2403.05681
Copy Paste: [[2403.05681]] DP-TabICL: In-Context Learning with Differentially Private Tabular Data(https://arxiv.org/abs/2403.05681)
Keywords: in-context
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.

Title: SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes

Authors: Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, Sunipa Dev
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.05696
Pdf URL: https://arxiv.org/pdf/2403.05696
Copy Paste: [[2403.05696]] SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes(https://arxiv.org/abs/2403.05696)
Keywords: generative
Abstract: While generative multilingual models are rapidly being deployed, their safety and fairness evaluations are largely limited to resources collected in English. This is especially problematic for evaluations targeting inherently socio-cultural phenomena such as stereotyping, where it is important to build multi-lingual resources that reflect the stereotypes prevalent in respective language communities. However, gathering these resources, at scale, in varied languages and regions pose a significant challenge as it requires broad socio-cultural knowledge and can also be prohibitively expensive. To overcome this critical gap, we employ a recently introduced approach that couples LLM generations for scale with culturally situated validations for reliability, and build SeeGULL Multilingual, a global-scale multilingual dataset of social stereotypes, containing over 25K stereotypes, spanning 20 languages, with human annotations across 23 regions, and demonstrate its utility in identifying gaps in model evaluations. Content warning: Stereotypes shared in this paper can be offensive.

Title: A Benchmark of Domain-Adapted Large Language Models for Generating Brief Hospital Course Summaries

Authors: Asad Aali, Dave Van Veen, Yamin Ishraq Arefeen, Jason Hom, Christian Bluethgen, Eduardo Pontes Reis, Sergios Gatidis, Namuun Clifford, Joseph Daws, Arash S. Tehrani, Jangwon Kim, Akshay S. Chaudhari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.05720
Pdf URL: https://arxiv.org/pdf/2403.05720
Copy Paste: [[2403.05720]] A Benchmark of Domain-Adapted Large Language Models for Generating Brief Hospital Course Summaries(https://arxiv.org/abs/2403.05720)
Keywords: in-context
Abstract: Brief hospital course (BHC) summaries are common clinical documents generated by summarizing clinical notes. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as BHC synthesis have not been shown. To enable the adaptation of LLMs for BHC synthesis, we introduce a novel benchmark consisting of a pre-processed dataset extracted from MIMIC-IV notes, encapsulating clinical note, and brief hospital course (BHC) pairs. We assess the performance of two general-purpose LLMs and three healthcare-adapted LLMs to improve BHC synthesis from clinical notes. Using clinical notes as input for generating BHCs, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs (Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5, GPT-4). We quantitatively evaluate the performance of these LLMs across varying context-length inputs using conventional natural language similarity metrics. We further perform a qualitative study where five diverse clinicians blindly compare clinician-written BHCs and two LLM-generated BHCs for 30 samples across metrics of comprehensiveness, conciseness, factual correctness, and fluency. Overall, we present a new benchmark and pre-processed dataset for using LLMs in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. We propose our work as a benchmark to motivate future works to adapt and assess the performance of LLMs in BHC synthesis.

Title: Inception Attacks: Immersive Hijacking in Virtual Reality Systems

Authors: Zhuolin Yang, Cathy Yuanchen Li, Arman Bhalla, Ben Y. Zhao, Haitao Zheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.05721
Pdf URL: https://arxiv.org/pdf/2403.05721
Copy Paste: [[2403.05721]] Inception Attacks: Immersive Hijacking in Virtual Reality Systems(https://arxiv.org/abs/2403.05721)
Keywords: generative
Abstract: Recent advances in virtual reality (VR) system provide fully immersive interactions that connect users with online resources, applications, and each other. Yet these immersive interfaces can make it easier for users to fall prey to a new type of security attacks. We introduce the inception attack, where an attacker controls and manipulates a user's interaction with their VR environment and applications, by trapping them inside a malicious VR application that masquerades as the full VR system. Once trapped in an "inception VR layer", all of the user's interactions with remote servers, network applications, and other VR users can be recorded or modified without their knowledge. This enables traditional attacks (recording passwords and modifying user actions in flight), as well as VR interaction attacks, where (with generative AI tools) two VR users interacting can experience two dramatically different conversations. In this paper, we introduce inception attacks and their design, and describe our implementation that works on all Meta Quest VR headsets. Our implementation of inception attacks includes a cloned version of the Meta Quest browser that can modify data as it's displayed to the user, and alter user input en route to the server (e.g. modify amount of $ transferred in a banking session). Our implementation also includes a cloned VRChat app, where an attacker can eavesdrop and modify live audio between two VR users. We then conduct a study on users with a range of VR experiences, execute the inception attack during their session, and debrief them about their experiences. Only 37% of users noticed the momentary visual "glitch" when the inception attack began, and all but 1 user attributed it to imperfections in the VR platform. Finally, we consider and discuss efficacy and tradeoffs for a wide range of potential inception defenses.

Title: Augmentations vs Algorithms: What Works in Self-Supervised Learning

Authors: Warren Morningstar, Alex Bijamov, Chris Duvarney, Luke Friedman, Neha Kalibhat, Luyang Liu, Philip Mansfield, Renan Rojas-Gomez, Karan Singhal, Bradley Green, Sushant Prakash
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.05726
Pdf URL: https://arxiv.org/pdf/2403.05726
Copy Paste: [[2403.05726]] Augmentations vs Algorithms: What Works in Self-Supervised Learning(https://arxiv.org/abs/2403.05726)
Keywords: self-supervised
Abstract: We study the relative effects of data augmentations, pretraining algorithms, and model architectures in Self-Supervised Learning (SSL). While the recent literature in this space leaves the impression that the pretraining algorithm is of critical importance to performance, understanding its effect is complicated by the difficulty in making objective and direct comparisons between methods. We propose a new framework which unifies many seemingly disparate SSL methods into a single shared template. Using this framework, we identify aspects in which methods differ and observe that in addition to changing the pretraining algorithm, many works also use new data augmentations or more powerful model architectures. We compare several popular SSL methods using our framework and find that many algorithmic additions, such as prediction networks or new losses, have a minor impact on downstream task performance (often less than $1\%$), while enhanced augmentation techniques offer more significant performance improvements ($2-4\%$). Our findings challenge the premise that SSL is being driven primarily by algorithmic improvements, and suggest instead a bitter lesson for SSL: that augmentation diversity and data / model scale are more critical contributors to recent advances in self-supervised learning.

Title: MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Authors: Xinyao Fan, Yueying Wu, Chang Xu, Yuhao Huang, Weiqing Liu, Jiang Bian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.05751
Pdf URL: https://arxiv.org/pdf/2403.05751
Copy Paste: [[2403.05751]] MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process(https://arxiv.org/abs/2403.05751)
Keywords: diffusion, generative
Abstract: Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.

Title: uniGradICON: A Foundation Model for Medical Image Registration

Authors: Lin Tian, Hastings Greer, Roland Kwitt, Francois-Xavier Vialard, Raul San Jose Estepar, Sylvain Bouix, Richard Rushmore, Marc Niethammer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.05780
Pdf URL: https://arxiv.org/pdf/2403.05780
Copy Paste: [[2403.05780]] uniGradICON: A Foundation Model for Medical Image Registration(https://arxiv.org/abs/2403.05780)
Keywords: foundation model
Abstract: Conventional medical image registration approaches directly optimize over the parameters of a transformation model. These approaches have been highly successful and are used generically for registrations of different anatomical regions. Recent deep registration networks are incredibly fast and accurate but are only trained for specific tasks. Hence, they are no longer generic registration approaches. We therefore propose uniGradICON, a first step toward a foundation model for registration providing 1) great performance \emph{across} multiple datasets which is not feasible for current learning-based registration methods, 2) zero-shot capabilities for new registration tasks suitable for different acquisitions, anatomical regions, and modalities compared to the training dataset, and 3) a strong initialization for finetuning on out-of-distribution registration tasks. UniGradICON unifies the speed and accuracy benefits of learning-based registration algorithms with the generic applicability of conventional non-deep-learning approaches. We extensively trained and evaluated uniGradICON on twelve different public datasets. Our code and the uniGradICON model are available at https://github.com/uncbiag/uniGradICON.

Title: Privacy-Preserving Diffusion Model Using Homomorphic Encryption

Authors: Yaojian Chen, Qiben Yan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2403.05794
Pdf URL: https://arxiv.org/pdf/2403.05794
Copy Paste: [[2403.05794]] Privacy-Preserving Diffusion Model Using Homomorphic Encryption(https://arxiv.org/abs/2403.05794)
Keywords: diffusion, generative
Abstract: In this paper, we introduce a privacy-preserving stable diffusion framework leveraging homomorphic encryption, called HE-Diffusion, which primarily focuses on protecting the denoising phase of the diffusion process. HE-Diffusion is a tailored encryption framework specifically designed to align with the unique architecture of stable diffusion, ensuring both privacy and functionality. To address the inherent computational challenges, we propose a novel min-distortion method that enables efficient partial image encryption, significantly reducing the overhead without compromising the model's output quality. Furthermore, we adopt a sparse tensor representation to expedite computational operations, enhancing the overall efficiency of the privacy-preserving diffusion process. We successfully implement HE-based privacy-preserving stable diffusion inference. The experimental results show that HE-Diffusion achieves 500 times speedup compared with the baseline method, and reduces time cost of the homomorphically encrypted inference to the minute level. Both the performance and accuracy of the HE-Diffusion are on par with the plaintext counterpart. Our approach marks a significant step towards integrating advanced cryptographic techniques with state-of-the-art generative models, paving the way for privacy-preserving and efficient image generation in critical applications.

Title: ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes

Authors: Zhichao Yang, Avijit Mitra, Sunjae Kwon, Hong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.05795
Pdf URL: https://arxiv.org/pdf/2403.05795
Copy Paste: [[2403.05795]] ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes(https://arxiv.org/abs/2403.05795)
Keywords: generative
Abstract: The advancement of natural language processing (NLP) systems in healthcare hinges on language model ability to interpret the intricate information contained within clinical notes. This process often requires integrating information from various time points in a patient's medical history. However, most earlier clinical language models were pretrained with a context length limited to roughly one clinical document. In this study, We introduce ClinicalMamba, a specialized version of the Mamba language model, pretrained on a vast corpus of longitudinal clinical notes to address the unique linguistic characteristics and information processing needs of the medical domain. ClinicalMamba, with 130 million and 2.8 billion parameters, demonstrates a superior performance in modeling clinical language across extended text lengths compared to Mamba and clinical Llama. With few-shot learning, ClinicalMamba achieves notable benchmarks in speed and accuracy, outperforming existing clinical language models and general domain large models like GPT-4 in longitudinal clinical notes information extraction tasks.

Title: A self-supervised CNN for image watermark removal

Authors: Chunwei Tian, Menghua Zheng, Tiancai Jiao, Wangmeng Zuo, Yanning Zhang, Chia-Wen Lin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.05807
Pdf URL: https://arxiv.org/pdf/2403.05807
Copy Paste: [[2403.05807]] A self-supervised CNN for image watermark removal(https://arxiv.org/abs/2403.05807)
Keywords: self-supervised
Abstract: Popular convolutional neural networks mainly use paired images in a supervised way for image watermark removal. However, watermarked images do not have reference images in the real world, which results in poor robustness of image watermark removal techniques. In this paper, we propose a self-supervised convolutional neural network (CNN) in image watermark removal (SWCNN). SWCNN uses a self-supervised way to construct reference watermarked images rather than given paired training samples, according to watermark distribution. A heterogeneous U-Net architecture is used to extract more complementary structural information via simple components for image watermark removal. Taking into account texture information, a mixed loss is exploited to improve visual effects of image watermark removal. Besides, a watermark dataset is conducted. Experimental results show that the proposed SWCNN is superior to popular CNNs in image watermark removal.

Title: Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

Authors: Junxiong Lin, Yan Wang, Zeng Tao, Boyang Wang, Qing Zhao, Haorang Wang, Xuan Tong, Xinji Mai, Yuxuan Lin, Wei Song, Jiawen Yu, Shaoqi Yan, Wenqiang Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.05808
Pdf URL: https://arxiv.org/pdf/2403.05808
Copy Paste: [[2403.05808]] Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution(https://arxiv.org/abs/2403.05808)
Keywords: diffusion
Abstract: Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation information on the diffusion process. Furthermore, these methods fail to consider the spatial variability inherent in the estimated blur kernel, stemming from factors such as motion jitter and out-of-focus elements in open-environment scenarios. This oversight results in a notable deviation of the image super-resolution effect from fundamental realities. To address these concerns, we introduce a framework known as Adaptive Multi-modal Fusion of \textbf{S}patially Variant Kernel Refinement with Diffusion Model for Blind Image \textbf{S}uper-\textbf{R}esolution (SSR). Within the SSR framework, we propose a Spatially Variant Kernel Refinement (SVKR) module. SVKR estimates a Depth-Informed Kernel, which takes the depth information into account and is spatially variant. Additionally, SVKR enhance the accuracy of depth information acquired from LR images, allowing for mutual enhancement between the depth map and blur kernel estimates. Finally, we introduce the Adaptive Multi-Modal Fusion (AMF) module to align the information from three modalities: low-resolution images, depth maps, and blur kernels. This alignment can constrain the diffusion model to generate more authentic SR results. Quantitative and qualitative experiments affirm the superiority of our approach, while ablation experiments corroborate the effectiveness of the modules we have proposed.

Title: SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

Authors: Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, Xiaolin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.05817
Pdf URL: https://arxiv.org/pdf/2403.05817
Copy Paste: [[2403.05817]] SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection(https://arxiv.org/abs/2403.05817)
Keywords: diffusion
Abstract: LiDAR-based 3D object detection plays an essential role in autonomous driving. Existing high-performing 3D object detectors usually build dense feature maps in the backbone network and prediction head. However, the computational costs introduced by the dense feature maps grow quadratically as the perception range increases, making these models hard to scale up to long-range detection. Some recent works have attempted to construct fully sparse detectors to solve this issue; nevertheless, the resulting models either rely on a complex multi-stage pipeline or exhibit inferior performance. In this work, we propose SAFDNet, a straightforward yet highly effective architecture, tailored for fully sparse 3D object detection. In SAFDNet, an adaptive feature diffusion strategy is designed to address the center feature missing problem. We conducted extensive experiments on Waymo Open, nuScenes, and Argoverse2 datasets. SAFDNet performed slightly better than the previous SOTA on the first two datasets but much better on the last dataset, which features long-range detection, verifying the efficacy of SAFDNet in scenarios where long-range detection is required. Notably, on Argoverse2, SAFDNet surpassed the previous best hybrid detector HEDNet by 2.6% mAP while being 2.1x faster, and yielded 2.1% mAP gains over the previous best sparse detector FSDv2 while being 1.3x faster. The code will be available at https://github.com/zhanggang001/HEDNet.

Title: TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation

Authors: Jian Qu, Xiaobo Ma, Jianfeng Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.05822
Pdf URL: https://arxiv.org/pdf/2403.05822
Copy Paste: [[2403.05822]] TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation(https://arxiv.org/abs/2403.05822)
Keywords: generative
Abstract: Over the years, network traffic analysis and generation have advanced significantly. From traditional statistical methods, the field has progressed to sophisticated deep learning techniques. This progress has improved the ability to detect complex patterns and security threats, as well as to test and optimize network performance. However, obstacles persist, such as the dependence on labeled data for analysis and the difficulty of generating traffic samples that follow realistic patterns. Pre-trained deep neural networks have emerged as powerful tools to resolve these issues, offering improved performance by learning robust data representations from large unlabeled datasets. Despite their benefits, existing pre-trained models face challenges like token length limitation, which restricts their usefulness in comprehensive traffic analysis and realistic traffic generation. To address these challenges, we introduce TrafficGPT, a deep learning model that can tackle complex challenges related to long flow classification and generation tasks. This model uses generative pre-training with the linear attention mechanism, which allows for a substantially increased capacity of up to 12,032 tokens from the previous limit of only 512 tokens. TrafficGPT demonstrates superior performance in classification tasks, reaching state-of-the-art levels. In generation tasks, it closely resembles real traffic flows, with low JS divergence and an F1 score close to 0.5 (representing a random guess) in discriminating generated data. These advancements hold promise for future applications in both traffic flow classification and generation tasks.

Title: Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines

Authors: Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.05846
Pdf URL: https://arxiv.org/pdf/2403.05846
Copy Paste: [[2403.05846]] Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines(https://arxiv.org/abs/2403.05846)
Keywords: diffusion
Abstract: Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensive analysis of two recent T2I models. Exploring compound prompts, we find that complex scenes describing multiple objects are composed progressively and more slowly compared to simple scenes; Exploring knowledge retrieval, we find that representation of uncommon concepts requires further computation compared to common concepts, and that knowledge retrieval is gradual across layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines.

Title: LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

Authors: Qihao Zhao, Yalun Dai, Hao Li, Wei Hu, Fan Zhang, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.05854
Pdf URL: https://arxiv.org/pdf/2403.05854
Copy Paste: [[2403.05854]] LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content(https://arxiv.org/abs/2403.05854)
Keywords: generative
Abstract: Long-tail recognition is challenging because it requires the model to learn good representations from tail categories and address imbalances across all categories. In this paper, we propose a novel generative and fine-tuning framework, LTGC, to handle long-tail recognition via leveraging generated content. Firstly, inspired by the rich implicit knowledge in large-scale models (e.g., large language models, LLMs), LTGC leverages the power of these models to parse and reason over the original tail data to produce diverse tail-class content. We then propose several novel designs for LTGC to ensure the quality of the generated data and to efficiently fine-tune the model using both the generated and original data. The visualization demonstrates the effectiveness of the generation module in LTGC, which produces accurate and diverse tail data. Additionally, the experimental results demonstrate that our LTGC outperforms existing state-of-the-art methods on popular long-tailed benchmarks.

Title: DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

Authors: Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong Liu, Yang Wu, Ying Shan, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.05895
Pdf URL: https://arxiv.org/pdf/2403.05895
Copy Paste: [[2403.05895]] DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos(https://arxiv.org/abs/2403.05895)
Keywords: self-supervised
Abstract: Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Depth and motion networks work collaboratively to faithfully model the geometry and dynamics of real-world scenes, which, in turn, benefits both depth and 3D motion estimation. Their predictions are further combined to synthesize a novel video frame for self-supervised training. As a core component of our framework, DO3D is a new motion disentanglement module that learns to predict camera ego-motion and instance-aware 3D object motion separately. To alleviate the difficulties in estimating non-rigid 3D object motions, they are decomposed to object-wise 6-DoF global transformations and a pixel-wise local 3D motion deformation field. Qualitative and quantitative experiments are conducted on three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our model delivers superior performance in all evaluated settings. For the depth estimation task, our model outperforms all compared research works in the high-resolution setting, attaining an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results (an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and largely improve the estimation of dynamic regions, demonstrating the effectiveness of our motion model. Our code will be available.

Title: RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection

Authors: Ximiao Zhang, Min Xu, Xiuzhuang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.05897
Pdf URL: https://arxiv.org/pdf/2403.05897
Copy Paste: [[2403.05897]] RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection(https://arxiv.org/abs/2403.05897)
Keywords: diffusion, self-supervised, anomaly
Abstract: Self-supervised feature reconstruction methods have shown promising advances in industrial image anomaly detection and localization. Despite this progress, these methods still face challenges in synthesizing realistic and diverse anomaly samples, as well as addressing the feature redundancy and pre-training bias of pre-trained feature. In this work, we introduce RealNet, a feature reconstruction network with realistic synthetic anomaly and adaptive feature selection. It is incorporated with three key innovations: First, we propose Strength-controllable Diffusion Anomaly Synthesis (SDAS), a diffusion process-based synthesis strategy capable of generating samples with varying anomaly strengths that mimic the distribution of real anomalous samples. Second, we develop Anomaly-aware Features Selection (AFS), a method for selecting representative and discriminative pre-trained feature subsets to improve anomaly detection performance while controlling computational costs. Third, we introduce Reconstruction Residuals Selection (RRS), a strategy that adaptively selects discriminative residuals for comprehensive identification of anomalous regions across multiple levels of granularity. We assess RealNet on four benchmark datasets, and our results demonstrate significant improvements in both Image AUROC and Pixel AUROC compared to the current state-o-the-art methods. The code, data, and models are available at https://github.com/cnulab/RealNet.

Title: SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to Imbalanced Data

Authors: Ming Zheng, Yang Yang, Zhi-Hang Zhao, Shan-Chao Gan, Yang Chen, Si-Kai Ni, Yang Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.05918
Pdf URL: https://arxiv.org/pdf/2403.05918
Copy Paste: [[2403.05918]] SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to Imbalanced Data(https://arxiv.org/abs/2403.05918)
Keywords: diffusion, generative
Abstract: In the field of data mining and machine learning, commonly used classification models cannot effectively learn in unbalanced data. In order to balance the data distribution before model training,oversamplingmethods are often used to generate data for a small number of classes to solve the problem of classifying unbalanced data. Most of the classical oversampling methods are based on theSMOTE technique, which only focuses on the local information of the data, and therefore the generated data may have the problem of not being realistic enough. In the current oversampling methods based on generative networks, the methods based on GANs can capture the true distribution of data, but there is the problem of pattern collapse and training instability in training; in the oversampling methods based on denoising diffusion probability models, the neural network of the inverse diffusion process using the U-Net is not applicable to tabular data, and although the MLP can be used to replace the U-Net, the problem exists due to the simplicity of the structure and the poor effect of removing noise. problem of poor noise removal. In order to overcome the above problems, we propose a novel oversampling method SEMRes-DDPM.In the SEMRes?DDPM backward diffusion process, a new neural network structure SEMST-ResNet is used, which is suitable for tabular data and has good noise removal effect, and it can generate tabular data with higher quality. Experiments show that the SEMResNet network removes noise better than MLP; SEMRes?DDPM generates data distributions that are closer to the real data distributions than TabDDPM with CWGAN-GP; on 20 real unbalanced tabular datasets with 9 classification models, SEMRes-DDPM improves the quality of the generated tabular data in terms of three evaluation metrics (F1, G-mean, AUC) with better classification performance than other SOTA oversampling methods.

Title: Learned 3D volumetric recovery of clouds and its uncertainty for climate analysis

Authors: Roi Ronen, Ilan Koren, Aviad Levis, Eshkol Eytan, Vadim Holodovsky, Yoav Y. Schechner
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.05932
Pdf URL: https://arxiv.org/pdf/2403.05932
Copy Paste: [[2403.05932]] Learned 3D volumetric recovery of clouds and its uncertainty for climate analysis(https://arxiv.org/abs/2403.05932)
Keywords: self-supervised
Abstract: Significant uncertainty in climate prediction and cloud physics is tied to observational gaps relating to shallow scattered clouds. Addressing these challenges requires remote sensing of their three-dimensional (3D) heterogeneous volumetric scattering content. This calls for passive scattering computed tomography (CT). We design a learning-based model (ProbCT) to achieve CT of such clouds, based on noisy multi-view spaceborne images. ProbCT infers - for the first time - the posterior probability distribution of the heterogeneous extinction coefficient, per 3D location. This yields arbitrary valuable statistics, e.g., the 3D field of the most probable extinction and its uncertainty. ProbCT uses a neural-field representation, making essentially real-time inference. ProbCT undergoes supervised training by a new labeled multi-class database of physics-based volumetric fields of clouds and their corresponding images. To improve out-of-distribution inference, we incorporate self-supervised learning through differential rendering. We demonstrate the approach in simulations and on real-world data, and indicate the relevance of 3D recovery and uncertainty to precipitation and renewable energy.

Title: General surgery vision transformer: A video pre-trained foundation model for general surgery

Authors: Samuel Schmidgall, Ji Woong Kim, Jeffery Jopling, Axel Krieger
Subjects: cs.CV, cs.LG, q-bio.TO
Abstract URL: https://arxiv.org/abs/2403.05949
Pdf URL: https://arxiv.org/pdf/2403.05949
Copy Paste: [[2403.05949]] General surgery vision transformer: A video pre-trained foundation model for general surgery(https://arxiv.org/abs/2403.05949)
Keywords: foundation model
Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Title: Can Generative Models Improve Self-Supervised Representation Learning?

Authors: Arash Afkanpour, Vahid Reza Khazaie, Sana Ayromlou, Fereshteh Forghani
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.05966
Pdf URL: https://arxiv.org/pdf/2403.05966
Copy Paste: [[2403.05966]] Can Generative Models Improve Self-Supervised Representation Learning?(https://arxiv.org/abs/2403.05966)
Keywords: self-supervised, generative
Abstract: The rapid advancement in self-supervised learning (SSL) has highlighted its potential to leverage unlabeled data for learning powerful visual representations. However, existing SSL approaches, particularly those employing different views of the same image, often rely on a limited set of predefined data augmentations. This constrains the diversity and quality of transformations, which leads to sub-optimal representations. In this paper, we introduce a novel framework that enriches the SSL paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image representation, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for self-supervised learning. Our experimental results demonstrate that our framework significantly enhances the quality of learned visual representations. This research demonstrates that incorporating generative models into the SSL workflow opens new avenues for exploring the potential of unlabeled visual data. This development paves the way for more robust and versatile representation learning techniques.

Title: Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in Low-Resource Languages

Authors: Christopher Toukmaji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06018
Pdf URL: https://arxiv.org/pdf/2403.06018
Copy Paste: [[2403.06018]] Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in Low-Resource Languages(https://arxiv.org/abs/2403.06018)
Keywords: in-context
Abstract: Large pre-trained language models (PLMs) are at the forefront of advances in Natural Language Processing. One widespread use case of PLMs is "prompting" - or in-context learning - where a user provides a description of a task and some completed examples of the task to a PLM as context before prompting the PLM to perform the task on a new example. Only the largest, most capable PLMs are able to perform in-context learning effectively, and these models are typically trained with a predominantly English corpus, leaving all other languages behind. The data limitations in most languages preclude the training of language-specific PLMs capable of prompting. Albeit the surge in work of prompting settings, it is still unclear how PLMs should be adapted cross-lingually specifically for prompting. We evaluate the possible methods to adapt LLaMa, a 7B parameter open-source PLM mainly trained in English, for prompting in low-resource languages, namely for Kinyarwanda, Hausa, and Luganda. We consider three methods: few-shot prompting (prompt), language-adaptive fine-tuning (LAFT), and neural machine translation (translate), and evaluate on abstractive summarization, multi-class topic classification, and named-entity recognition. Although LAFT carries the greatest compute cost and intuitively should lead to the best results, our experiments exhibit that LAFT is only occasionally the optimal choice for adapting PLMs for prompting. Rather, the translate and prompt settings are a compute-efficient and cost-effective method of few-shot prompting for the selected low-resource languages. We find that the results are task and language dependent but find that the prompting method is the best on average across all tasks and languages. Results show that the prompt setting performs better than both translating and LAFT with statistical significance for all shots when aggregated across all tasks and languages.

Title: Multi-conditioned Graph Diffusion for Neural Architecture Search

Authors: Rohan Asthana, Joschua Conrad, Youssef Dawoud, Maurits Ortmanns, Vasileios Belagiannis
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.06020
Pdf URL: https://arxiv.org/pdf/2403.06020
Copy Paste: [[2403.06020]] Multi-conditioned Graph Diffusion for Neural Architecture Search(https://arxiv.org/abs/2403.06020)
Keywords: diffusion
Abstract: Neural architecture search automates the design of neural network architectures usually by exploring a large and thus complex architecture search space. To advance the architecture search, we present a graph diffusion-based NAS approach that uses discrete conditional graph diffusion processes to generate high-performing neural network architectures. We then propose a multi-conditioned classifier-free guidance approach applied to graph diffusion networks to jointly impose constraints such as high accuracy and low hardware latency. Unlike the related work, our method is completely differentiable and requires only a single model training. In our evaluations, we show promising results on six standard benchmarks, yielding novel and unique architectures at a fast speed, i.e. less than 0.2 seconds per architecture. Furthermore, we demonstrate the generalisability and efficiency of our method through experiments on ImageNet dataset.

Title: Reframe Anything: LLM Agent for Open World Video Reframing

Authors: Jiawang Cao, Yongliang Wu, Weiheng Chi, Wenbo Zhu, Ziyue Su, Jay Wu
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2403.06070
Pdf URL: https://arxiv.org/pdf/2403.06070
Copy Paste: [[2403.06070]] Reframe Anything: LLM Agent for Open World Video Reframing(https://arxiv.org/abs/2403.06070)
Keywords: foundation model
Abstract: The proliferation of mobile devices and social media has revolutionized content dissemination, with short-form video becoming increasingly prevalent. This shift has introduced the challenge of video reframing to fit various screen aspect ratios, a process that highlights the most compelling parts of a video. Traditionally, video reframing is a manual, time-consuming task requiring professional expertise, which incurs high production costs. A potential solution is to adopt some machine learning models, such as video salient object detection, to automate the process. However, these methods often lack generalizability due to their reliance on specific training data. The advent of powerful large language models (LLMs) open new avenues for AI capabilities. Building on this, we introduce Reframe Any Video Agent (RAVA), a LLM-based agent that leverages visual foundation models and human instructions to restructure visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.

Title: FrameQuant: Flexible Low-Bit Quantization for Transformers

Authors: Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.06082
Pdf URL: https://arxiv.org/pdf/2403.06082
Copy Paste: [[2403.06082]] FrameQuant: Flexible Low-Bit Quantization for Transformers(https://arxiv.org/abs/2403.06082)
Keywords: foundation model
Abstract: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.

Title: Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models

Authors: Esmaeil Seraj, Walter Talamonti
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2403.06088
Pdf URL: https://arxiv.org/pdf/2403.06088
Copy Paste: [[2403.06088]] Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models(https://arxiv.org/abs/2403.06088)
Keywords: foundation model
Abstract: In the burgeoning field of intelligent transportation systems, enhancing vehicle-driver interaction through facial attribute recognition, such as facial expression, eye gaze, age, etc., is of paramount importance for safety, personalization, and overall user experience. However, the scarcity of comprehensive large-scale, real-world datasets poses a significant challenge for training robust multi-task models. Existing literature often overlooks the potential of synthetic datasets and the comparative efficacy of state-of-the-art vision foundation models in such constrained settings. This paper addresses these gaps by investigating the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle, such as gaze plane, age, and facial expression. Utilizing transfer learning techniques with both pre-trained Vision Transformer (ViT) and Residual Network (ResNet) models, we explore various training and adaptation methods to optimize performance, particularly when data availability is limited. We provide extensive post-evaluation analysis, investigating the effects of synthetic data distributions on model performance in in-distribution data and out-of-distribution inference. Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context, which is attributed to the mismatch in model complexity relative to task complexity. Our results highlight the challenges and opportunities for enhancing the use of synthetic data and vision foundation models in practical applications.

Title: Diffusion Models Trained with Large Data Are Transferable Visual Models

Authors: Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06090
Pdf URL: https://arxiv.org/pdf/2403.06090
Copy Paste: [[2403.06090]] Diffusion Models Trained with Large Data Are Transferable Visual Models(https://arxiv.org/abs/2403.06090)
Keywords: diffusion
Abstract: We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data (even synthetic data only), including monocular depth, surface normal, image segmentation, matting, human pose estimation, among virtually many others. Previous works have adapted diffusion models for various perception tasks, often reformulating these tasks as generation processes to align with the diffusion process. In sharp contrast, we demonstrate that fine-tuning these models with minimal adjustments can be a more effective alternative, offering the advantages of being embarrassingly simple and significantly faster. As the backbone network of Stable Diffusion models is trained on giant datasets comprising billions of images, we observe very robust generalization capabilities of the diffusion backbone. Experimental results showcase the remarkable transferability of the backbone of diffusion models across diverse tasks and real-world datasets.

Title: VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Authors: Wenhao Wang, Yi Yang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.06098
Pdf URL: https://arxiv.org/pdf/2403.06098
Copy Paste: [[2403.06098]] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models(https://arxiv.org/abs/2403.06098)
Keywords: diffusion
Abstract: The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, as well as other text-to-video diffusion models, highly relies on the prompts, and there is no publicly available dataset featuring a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users. Additionally, the dataset includes 6.69 million videos generated by four state-of-the-art diffusion models and some related data. We initially demonstrate the curation of this large-scale dataset, which is a time-consuming and costly process. Subsequently, we show how the proposed VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Based on the analysis of these prompts, we identify the necessity for a new prompt dataset specifically designed for text-to-video generation and gain insights into the preferences of real users when creating videos. Our large-scale and diverse dataset also inspires many exciting new research areas. For instance, to develop better, more efficient, and safer text-to-video diffusion models, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models. We make the collected dataset VidProM publicly available at GitHub and Hugging Face under the CC-BY- NC 4.0 License.

Title: Coherent Temporal Synthesis for Incremental Action Segmentation

Authors: Guodong Ding, Hans Golong, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06102
Pdf URL: https://arxiv.org/pdf/2403.06102
Copy Paste: [[2403.06102]] Coherent Temporal Synthesis for Incremental Action Segmentation(https://arxiv.org/abs/2403.06102)
Keywords: generative
Abstract: Data replay is a successful incremental learning technique for images. It prevents catastrophic forgetting by keeping a reservoir of previous data, original or synthesized, to ensure the model retains past knowledge while adapting to novel concepts. However, its application in the video domain is rudimentary, as it simply stores frame exemplars for action recognition. This paper presents the first exploration of video data replay techniques for incremental action segmentation, focusing on action temporal modeling. We propose a Temporally Coherent Action (TCA) model, which represents actions using a generative model instead of storing individual frames. The integration of a conditioning variable that captures temporal coherence allows our model to understand the evolution of action features over time. Therefore, action segments generated by TCA for replay are diverse and temporally coherent. In a 10-task incremental setup on the Breakfast dataset, our approach achieves significant increases in accuracy for up to 22% compared to the baselines.

Title: Universal Debiased Editing for Fair Medical Image Classification

Authors: Ruinan Jin, Wenlong Deng, Minghui Chen, Xiaoxiao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06104
Pdf URL: https://arxiv.org/pdf/2403.06104
Copy Paste: [[2403.06104]] Universal Debiased Editing for Fair Medical Image Classification(https://arxiv.org/abs/2403.06104)
Keywords: foundation model
Abstract: In the era of Foundation Models' (FMs) rising prominence in AI, our study addresses the challenge of biases in medical images while using FM API, particularly spurious correlations between pixels and sensitive attributes. Traditional methods for bias mitigation face limitations due to the restricted access to web-hosted FMs and difficulties in addressing the underlying bias encoded within the FM API. We propose an U(niversal) D(ebiased) E(diting) strategy, termed UDE, which generates UDE noise to mask such spurious correlation. UDE is capable of mitigating bias both within the FM API embedding and the images themselves. Furthermore, UDE is suitable for both white-box and black-box FM APIs, where we introduced G(reedy) (Z)eroth-O(rder) (GeZO) optimization for it when the gradient is inaccessible in black-box APIs. Our whole pipeline enables fairness-aware image editing that can be applied across various medical contexts without requiring direct model manipulation or significant computational resources. Our empirical results demonstrate the method's effectiveness in maintaining fairness and utility across different patient groups and diseases. In the era of AI-driven medicine, this work contributes to making healthcare diagnostics more equitable, showcasing a practical solution for bias mitigation in pre-trained image FMs.

Title: In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Authors: Junhui Yin, Xinyu Zhang, Lin Wu, Xianghua Xie, Xiaojie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06126
Pdf URL: https://arxiv.org/pdf/2403.06126
Copy Paste: [[2403.06126]] In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model(https://arxiv.org/abs/2403.06126)
Keywords: in-context
Abstract: Existing pre-trained vision-language models, e.g., CLIP, have demonstrated impressive zero-shot generalization capabilities in various downstream tasks. However, the performance of these models will degrade significantly when test inputs present different distributions. To this end, we explore the concept of test-time prompt tuning (TTPT), which enables the adaptation of the CLIP model to novel downstream tasks through only one step of optimization on an unsupervised objective that involves the test sample. Motivated by in-context learning within field of natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition task. InCPL involves associating a new test sample with very few or even just one labeled example as its in-context prompt. As a result, it can reliably estimate a label for the test sample, thereby facilitating the model adaptation process. InCPL first employs a token net to represent language descriptions as visual prompts that the vision encoder of a CLIP model can comprehend. Paired with in-context examples, we further propose a context-aware unsupervised loss to optimize test sample-aware visual prompts. This optimization allows a pre-trained, frozen CLIP model to be adapted to a test sample from any task using its learned adaptive prompt. Our method has demonstrated superior performance and achieved state-of-the-art results across various downstream datasets.

Title: FedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning

Authors: Zhuo Zhang, Jingyuan Zhang, Jintao Huang, Lizhen Qu, Hongzhi Zhang, Zenglin Xu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06131
Pdf URL: https://arxiv.org/pdf/2403.06131
Copy Paste: [[2403.06131]] FedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning(https://arxiv.org/abs/2403.06131)
Keywords: in-context
Abstract: Instruction tuning has proven essential for enhancing the performance of large language models (LLMs) in generating human-aligned responses. However, collecting diverse, high-quality instruction data for tuning poses challenges, particularly in privacy-sensitive domains. Federated instruction tuning (FedIT) has emerged as a solution, leveraging federated learning from multiple data owners while preserving privacy. Yet, it faces challenges due to limited instruction data and vulnerabilities to training data extraction attacks. To address these issues, we propose a novel federated algorithm, FedPIT, which utilizes LLMs' in-context learning capability to self-generate task-specific synthetic data for training autonomously. Our method employs parameter-isolated training to maintain global parameters trained on synthetic data and local parameters trained on augmented local data, effectively thwarting data extraction attacks. Extensive experiments on real-world medical data demonstrate the effectiveness of FedPIT in improving federated few-shot performance while preserving privacy and robustness against data heterogeneity.

Title: MACE: Mass Concept Erasure in Diffusion Models

Authors: Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, Adams Wai-Kin Kong
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06135
Pdf URL: https://arxiv.org/pdf/2403.06135
Copy Paste: [[2403.06135]] MACE: Mass Concept Erasure in Diffusion Models(https://arxiv.org/abs/2403.06135)
Keywords: diffusion
Abstract: The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at https://github.com/Shilin-LU/MACE.

Title: RESTORE: Towards Feature Shift for Vision-Language Prompt Learning

Authors: Yuncheng Yang, Chuyan Zhang, Zuopeng Yang, Yuting Gao, Yulei Qin, Ke Li, Xing Sun, Jie Yang, Yun Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06136
Pdf URL: https://arxiv.org/pdf/2403.06136
Copy Paste: [[2403.06136]] RESTORE: Towards Feature Shift for Vision-Language Prompt Learning(https://arxiv.org/abs/2403.06136)
Keywords: foundation model
Abstract: Prompt learning is effective for fine-tuning foundation models to improve their generalization across a variety of downstream tasks. However, the prompts that are independently optimized along a single modality path, may sacrifice the vision-language alignment of pre-trained models in return for improved performance on specific tasks and classes, leading to poorer generalization. In this paper, we first demonstrate that prompt tuning along only one single branch of CLIP (e.g., language or vision) is the reason why the misalignment occurs. Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints inherent in the two-tower architecture. To address such misalignment, we first propose feature shift, which is defined as the variation of embeddings after introducing the learned prompts, to serve as an explanatory tool. We dive into its relation with generalizability and thereafter propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency. To be more specific, to prevent feature misalignment, a feature shift consistency is introduced to synchronize inter-modal feature shifts by measuring and regularizing the magnitude of discrepancy during prompt tuning. In addition, we propose a "surgery" block to avoid short-cut hacking, where cross-modal misalignment can still be severe if the feature shift of each modality varies drastically at the same rate. It is implemented as feed-forward adapters upon both modalities to alleviate the misalignment problem. Extensive experiments on 15 datasets demonstrate that our method outperforms the state-of-the-art prompt tuning methods without compromising feature alignment.

Title: Bayesian Random Semantic Data Augmentation for Medical Image Classification

Authors: Yaoyao Zhu, Xiuding Cai, Xueyao Wang, Yu Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06138
Pdf URL: https://arxiv.org/pdf/2403.06138
Copy Paste: [[2403.06138]] Bayesian Random Semantic Data Augmentation for Medical Image Classification(https://arxiv.org/abs/2403.06138)
Keywords: generative
Abstract: Data augmentation is a critical regularization technique for deep neural networks, particularly in medical image classification. Popular data augmentation approaches include image transformation-based methods, generative data augmentation, and automatic data augmentation. However, these approaches encounter notable limitations: image transformation-based and automated data augmentation techniques cannot implement semantic transformations, leading to a constrained variety of augmented samples, and generative data augmentation methods are computationally expensive. In response to these challenges, we proposed Bayesian Random Semantic Data Augmentation (BRSDA), a novel, efficient, and plug-and-play semantic data augmentation method. BRSDA is motivated by a simple translation in the feature space along specific directions that can effectuate semantic transformations. When given a feature, we define its augmentable semantic magnitude as a random variable and estimate its distribution using variational Bayesian, then sample semantic magnitude and add to the randomly selected semantic direction to achieve semantic data augmentation. We demonstrate the effectiveness of BRSDA on five 2D and six 3D medical image datasets covering nine modalities. We also test BRSDA with mainstream neural network architectures, showcasing its robustness. Furthermore, combining BRSDA with other leading data augmentation methods achieves superior performance. Code is available online at \url{https://github.com/YaoyaoZhu19/BRSDA}.

Title: All-in-one platform for AI R&D in medical imaging, encompassing data collection, selection, annotation, and pre-processing

Authors: Changhee Han, Kyohei Shibano, Wataru Ozaki, Keishiro Osaki, Takafumi Haraguchi, Daisuke Hirahara, Shumon Kimura, Yasuyuki Kobayashi, Gento Mogi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06145
Pdf URL: https://arxiv.org/pdf/2403.06145
Copy Paste: [[2403.06145]] All-in-one platform for AI R&D in medical imaging, encompassing data collection, selection, annotation, and pre-processing(https://arxiv.org/abs/2403.06145)
Keywords: generative
Abstract: Deep Learning is advancing medical imaging Research and Development (R&D), leading to the frequent clinical use of Artificial Intelligence/Machine Learning (AI/ML)-based medical devices. However, to advance AI R&D, two challenges arise: 1) significant data imbalance, with most data from Europe/America and under 10% from Asia, despite its 60% global population share; and 2) hefty time and investment needed to curate proprietary datasets for commercial use. In response, we established the first commercial medical imaging platform, encompassing steps like: 1) data collection, 2) data selection, 3) annotation, and 4) pre-processing. Moreover, we focus on harnessing under-represented data from Japan and broader Asia, including Computed Tomography, Magnetic Resonance Imaging, and Whole Slide Imaging scans. Using the collected data, we are preparing/providing ready-to-use datasets for medical AI R&D by 1) offering these datasets to AI firms, biopharma, and medical device makers and 2) using them as training/test data to develop tailored AI solutions for such entities. We also aim to merge Blockchain for data security and plan to synthesize rare disease data via generative AI. DataHub Website: https://medical-datahub.ai/

Title: GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection

Authors: Huaxin Zhang, Xiang Wang, Xiaohao Xu, Xiaonan Huang, Chuchu Han, Yuehuan Wang, Changxin Gao, Shanjun Zhang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06154
Pdf URL: https://arxiv.org/pdf/2403.06154
Copy Paste: [[2403.06154]] GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection(https://arxiv.org/abs/2403.06154)
Keywords: anomaly
Abstract: In recent years, video anomaly detection has been extensively investigated in both unsupervised and weakly supervised settings to alleviate costly temporal labeling. Despite significant progress, these methods still suffer from unsatisfactory results such as numerous false alarms, primarily due to the absence of precise temporal anomaly annotation. In this paper, we present a novel labeling paradigm, termed "glance annotation", to achieve a better balance between anomaly detection accuracy and annotation cost. Specifically, glance annotation is a random frame within each abnormal event, which can be easily accessed and is cost-effective. To assess its effectiveness, we manually annotate the glance annotations for two standard video anomaly detection datasets: UCF-Crime and XD-Violence. Additionally, we propose a customized GlanceVAD method, that leverages gaussian kernels as the basic unit to compose the temporal anomaly distribution, enabling the learning of diverse and robust anomaly representations from the glance annotations. Through comprehensive analysis and experiments, we verify that the proposed labeling paradigm can achieve an excellent trade-off between annotation cost and model performance. Extensive experimental results also demonstrate the effectiveness of our GlanceVAD approach, which significantly outperforms existing advanced unsupervised and weakly supervised methods. Code and annotations will be publicly available at https://github.com/pipixin321/GlanceVAD.

Title: Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation

Authors: Paweł A. Pierzchlewicz, Caio da Silva, R. James Cotton, Fabian H. Sinz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06164
Pdf URL: https://arxiv.org/pdf/2403.06164
Copy Paste: [[2403.06164]] Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation(https://arxiv.org/abs/2403.06164)
Keywords: diffusion
Abstract: Single camera 3D pose estimation is an ill-defined problem due to inherent ambiguities from depth, occlusion or keypoint noise. Multi-hypothesis pose estimation accounts for this uncertainty by providing multiple 3D poses consistent with the 2D measurements. Current research has predominantly concentrated on generating multiple hypotheses for single frame static pose estimation. In this study we focus on the new task of multi-hypothesis motion estimation. Motion estimation is not simply pose estimation applied to multiple frames, which would ignore temporal correlation across frames. Instead, it requires distributions which are capable of generating temporally consistent samples, which is significantly more challenging. To this end, we introduce Platypose, a framework that uses a diffusion model pretrained on 3D human motion sequences for zero-shot 3D pose sequence estimation. Platypose outperforms baseline methods on multiple hypotheses for motion estimation. Additionally, Platypose also achieves state-of-the-art calibration and competitive joint error when tested on static poses from Human3.6M, MPI-INF-3DHP and 3DPW. Finally, because it is zero-shot, our method generalizes flexibly to different settings such as multi-camera inference.

Title: DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

Authors: Xiaobin Hu, Xu Peng, Donghao Luo, Xiaozhong Ji, Jinlong Peng, Zhengkai Jiang, Jiangning Zhang, Taisong Jin, Chengjie Wang, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06168
Pdf URL: https://arxiv.org/pdf/2403.06168
Copy Paste: [[2403.06168]] DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation(https://arxiv.org/abs/2403.06168)
Keywords: diffusion
Abstract: Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of "matting anything". Our DiffuMatting can 1). act as an anything matting factory with high accurate annotations 2). be well-compatible with community LoRAs or various conditional control approaches to achieve the community-friendly art design and controllable generation. Specifically, inspired by green-screen-matting, we aim to teach the diffusion model to paint on a fixed green screen canvas. To this end, a large-scale greenscreen dataset (Green100K) is collected as a training dataset for DiffuMatting. Secondly, a green background control loss is proposed to keep the drawing board as a pure green color to distinguish the foreground and background. To ensure the synthesized object has more edge details, a detailed-enhancement of transition boundary loss is proposed as a guideline to generate objects with more complicated edge structures. Aiming to simultaneously generate the object and its matting annotation, we build a matting head to make a green color removal in the latent space of the VAE decoder. Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation). As a matting-data generator, DiffuMatting synthesizes general object and portrait matting sets, effectively reducing the relative MSE error by 15.4% in General Object Matting and 11.4% in Portrait Matting tasks.

Title: An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling

Authors: Xunpeng Huang, Hanze Dong, Difan Zou, Tong Zhang
Subjects: cs.LG, math.OC, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2403.06183
Pdf URL: https://arxiv.org/pdf/2403.06183
Copy Paste: [[2403.06183]] An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling(https://arxiv.org/abs/2403.06183)
Keywords: diffusion
Abstract: Understanding the dimension dependency of computational complexity in high-dimensional sampling problem is a fundamental problem, both from a practical and theoretical perspective. Compared with samplers with unbiased stationary distribution, e.g., Metropolis-adjusted Langevin algorithm (MALA), biased samplers, e.g., Underdamped Langevin Dynamics (ULD), perform better in low-accuracy cases just because a lower dimension dependency in their complexities. Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions. Nonetheless, it remains open whether such property establishes for more general cases. In this paper, we investigate the prior diffusion technique for the target distributions satisfying log-Sobolev inequality (LSI), which covers a much broader class of distributions compared to the strongly log-concave ones. In particular, we prove that the modified Langevin algorithm can also obtain the dimension-independent convergence of KL divergence with different step size schedules. The core of our proof technique is a novel construction of an interpolating SDE, which significantly helps to conduct a more accurate characterization of the discrete updates of the overdamped Langevin dynamics. Our theoretical analysis demonstrates the benefits of prior diffusion for a broader class of target distributions and provides new insights into developing faster sampling algorithms.

Title: Harmonious Group Choreography with Trajectory-Controllable Diffusion

Authors: Yuqin Dai, Wanlu Zhu, Ronghui Li, Zeping Ren, Xiangzheng Zhou, Xiu Li, Jun Li, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06189
Pdf URL: https://arxiv.org/pdf/2403.06189
Copy Paste: [[2403.06189]] Harmonious Group Choreography with Trajectory-Controllable Diffusion(https://arxiv.org/abs/2403.06189)
Keywords: diffusion
Abstract: Creating group choreography from music has gained attention in cultural entertainment and virtual reality, aiming to coordinate visually cohesive and diverse group movements. Despite increasing interest, recent works face challenges in achieving aesthetically appealing choreography, primarily for two key issues: multi-dancer collision and single-dancer foot slide. To address these issues, we propose a Trajectory-Controllable Diffusion (TCDiff), a novel approach that harnesses non-overlapping trajectories to facilitate coherent dance movements. Specifically, to tackle dancer collisions, we introduce a Dance-Beat Navigator capable of generating trajectories for multiple dancers based on the music, complemented by a Distance-Consistency loss to maintain appropriate spacing among trajectories within a reasonable threshold. To mitigate foot sliding, we present a Footwork Adaptor that utilizes trajectory displacement from adjacent frames to enable flexible footwork, coupled with a Relative Forward-Kinematic loss to adjust the positioning of individual dancers' root nodes and joints. Extensive experiments demonstrate that our method achieves state-of-the-art results.

Title: On depth prediction for autonomous driving using self-supervised learning

Authors: Houssem Boulahbal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06194
Pdf URL: https://arxiv.org/pdf/2403.06194
Copy Paste: [[2403.06194]] On depth prediction for autonomous driving using self-supervised learning(https://arxiv.org/abs/2403.06194)
Keywords: self-supervised, generative
Abstract: Perception of the environment is a critical component for enabling autonomous driving. It provides the vehicle with the ability to comprehend its surroundings and make informed decisions. Depth prediction plays a pivotal role in this process, as it helps the understanding of the geometry and motion of the environment. This thesis focuses on the challenge of depth prediction using monocular self-supervised learning techniques. The problem is approached from a broader perspective first, exploring conditional generative adversarial networks (cGANs) as a potential technique to achieve better generalization was performed. In doing so, a fundamental contribution to the conditional GANs, the acontrario cGAN was proposed. The second contribution entails a single image-to-depth self-supervised method, proposing a solution for the rigid-scene assumption using a novel transformer-based method that outputs a pose for each dynamic object. The third significant aspect involves the introduction of a video-to-depth map forecasting approach. This method serves as an extension of self-supervised techniques to predict future depths. This involves the creation of a novel transformer model capable of predicting the future depth of a given scene. Moreover, the various limitations of the aforementioned methods were addressed and a video-to-video depth maps model was proposed. This model leverages the spatio-temporal consistency of the input and output sequence to predict a more accurate depth sequence output. These methods have significant applications in autonomous driving (AD) and advanced driver assistance systems (ADAS).

Title: Cooperative Classification and Rationalization for Graph Generalization

Authors: Linan Yue, Qi Liu, Ye Liu, Weibo Gao, Fangzhou Yao, Wenfeng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06239
Pdf URL: https://arxiv.org/pdf/2403.06239
Copy Paste: [[2403.06239]] Cooperative Classification and Rationalization for Graph Generalization(https://arxiv.org/abs/2403.06239)
Keywords: generative
Abstract: Graph Neural Networks (GNNs) have achieved impressive results in graph classification tasks, but they struggle to generalize effectively when faced with out-of-distribution (OOD) data. Several approaches have been proposed to address this problem. Among them, one solution is to diversify training distributions in vanilla classification by modifying the data environment, yet accessing the environment information is complex. Besides, another promising approach involves rationalization, extracting invariant rationales for predictions. However, extracting rationales is difficult due to limited learning signals, resulting in less accurate rationales and diminished predictions. To address these challenges, in this paper, we propose a Cooperative Classification and Rationalization (C2R) method, consisting of the classification and the rationalization module. Specifically, we first assume that multiple environments are available in the classification module. Then, we introduce diverse training distributions using an environment-conditional generative network, enabling robust graph representations. Meanwhile, the rationalization module employs a separator to identify relevant rationale subgraphs while the remaining non-rationale subgraphs are de-correlated with labels. Next, we align graph representations from the classification module with rationale subgraph representations using the knowledge distillation methods, enhancing the learning signal for rationales. Finally, we infer multiple environments by gathering non-rationale representations and incorporate them into the classification module for cooperative learning. Extensive experimental results on both benchmarks and synthetic datasets demonstrate the effectiveness of C2R. Code is available at https://github.com/yuelinan/Codes-of-C2R.

Title: Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

Authors: Mingyu Lee, Jongwon Choi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06247
Pdf URL: https://arxiv.org/pdf/2403.06247
Copy Paste: [[2403.06247]] Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation(https://arxiv.org/abs/2403.06247)
Keywords: anomaly
Abstract: We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.

Title: SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations

Authors: Amit Meghanani, Thomas Hain
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.06260
Pdf URL: https://arxiv.org/pdf/2403.06260
Copy Paste: [[2403.06260]] SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations(https://arxiv.org/abs/2403.06260)
Keywords: self-supervised
Abstract: There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.

Title: FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Authors: Youyuan Zhang, Xuan Ju, James J. Clark
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06269
Pdf URL: https://arxiv.org/pdf/2403.06269
Copy Paste: [[2403.06269]] FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing(https://arxiv.org/abs/2403.06269)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.

Title: An End-to-End Deep Learning Generative Framework for Refinable Shape Matching and Generation

Authors: Soodeh Kalaie, Andy Bulpitt, Alejandro F. Frangi, Ali Gooya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06317
Pdf URL: https://arxiv.org/pdf/2403.06317
Copy Paste: [[2403.06317]] An End-to-End Deep Learning Generative Framework for Refinable Shape Matching and Generation(https://arxiv.org/abs/2403.06317)
Keywords: generative
Abstract: Generative modelling for shapes is a prerequisite for In-Silico Clinical Trials (ISCTs), which aim to cost-effectively validate medical device interventions using synthetic anatomical shapes, often represented as 3D surface meshes. However, constructing AI models to generate shapes closely resembling the real mesh samples is challenging due to variable vertex counts, connectivities, and the lack of dense vertex-wise correspondences across the training data. Employing graph representations for meshes, we develop a novel unsupervised geometric deep-learning model to establish refinable shape correspondences in a latent space, construct a population-derived atlas and generate realistic synthetic shapes. We additionally extend our proposed base model to a joint shape generative-clustering multi-atlas framework to incorporate further variability and preserve more details in the generated shapes. Experimental results using liver and left-ventricular models demonstrate the approach's applicability to computational medicine, highlighting its suitability for ISCTs through a comparative analysis.

Title: Fake or Compromised? Making Sense of Malicious Clients in Federated Learning

Authors: Hamid Mozaffari, Sunav Choudhary, Amir Houmansadr
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2403.06319
Pdf URL: https://arxiv.org/pdf/2403.06319
Copy Paste: [[2403.06319]] Fake or Compromised? Making Sense of Malicious Clients in Federated Learning(https://arxiv.org/abs/2403.06319)
Keywords: generative
Abstract: Federated learning (FL) is a distributed machine learning paradigm that enables training models on decentralized data. The field of FL security against poisoning attacks is plagued with confusion due to the proliferation of research that makes different assumptions about the capabilities of adversaries and the adversary models they operate under. Our work aims to clarify this confusion by presenting a comprehensive analysis of the various poisoning attacks and defensive aggregation rules (AGRs) proposed in the literature, and connecting them under a common framework. To connect existing adversary models, we present a hybrid adversary model, which lies in the middle of the spectrum of adversaries, where the adversary compromises a few clients, trains a generative (e.g., DDPM) model with their compromised samples, and generates new synthetic data to solve an optimization for a stronger (e.g., cheaper, more practical) attack against different robust aggregation rules. By presenting the spectrum of FL adversaries, we aim to provide practitioners and researchers with a clear understanding of the different types of threats they need to consider when designing FL systems, and identify areas where further research is needed.

Title: Transferable Reinforcement Learning via Generalized Occupancy Models

Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.06328
Pdf URL: https://arxiv.org/pdf/2403.06328
Copy Paste: [[2403.06328]] Transferable Reinforcement Learning via Generalized Occupancy Models(https://arxiv.org/abs/2403.06328)
Keywords: diffusion
Abstract: Intelligent agents must be generalists - showing the ability to quickly adapt and generalize to varying tasks. Within the framework of reinforcement learning (RL), model-based RL algorithms learn a task-agnostic dynamics model of the world, in principle allowing them to generalize to arbitrary rewards. However, one-step models naturally suffer from compounding errors, making them ineffective for problems with long horizons and large state spaces. In this work, we propose a novel class of models - generalized occupancy models (GOMs) - that retain the generality of model-based RL while avoiding compounding error. The key idea behind GOMs is to model the distribution of all possible long-term outcomes from a given state under the coverage of a stationary dataset, along with a policy that realizes a particular outcome from the given state. These models can then quickly be used to select the optimal action for arbitrary new tasks, without having to redo policy optimization. By directly modeling long-term outcomes, GOMs avoid compounding error while retaining generality across arbitrary reward functions. We provide a practical instantiation of GOMs using diffusion models and show its efficacy as a new class of transferable models, both theoretically and empirically across a variety of simulated robotics problems. Videos and code at https://weirdlabuw.github.io/gom/.

Title: Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Authors: Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06351
Pdf URL: https://arxiv.org/pdf/2403.06351
Copy Paste: [[2403.06351]] Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos(https://arxiv.org/abs/2403.06351)
Keywords: diffusion, generative
Abstract: We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To pave the way for future advancements in this field, we curate a comprehensive exo-to-ego cross-view translation benchmark. It consists of a diverse collection of synchronized ego-exo tabletop activity video pairs sourced from three public datasets: H2O, Aria Pilot, and Assembly101. The experimental results validate that Exo2Ego delivers photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of both synthesis quality and generalization ability to new actions.

Title: Say Anything with Any Style

Authors: Shuai Tan, Bin Ji, Yu Ding, Ye Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06363
Pdf URL: https://arxiv.org/pdf/2403.06363
Copy Paste: [[2403.06363]] Say Anything with Any Style(https://arxiv.org/abs/2403.06363)
Keywords: generative
Abstract: Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.

Title: Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Authors: Shuai Tan, Bin Ji, Ye Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06365
Pdf URL: https://arxiv.org/pdf/2403.06365
Copy Paste: [[2403.06365]] Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style(https://arxiv.org/abs/2403.06365)
Keywords: diffusion
Abstract: Although automatically animating audio-driven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audiovisual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.

Title: Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Authors: Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, Tiviatis Sim, Kenji Kawaguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06381
Pdf URL: https://arxiv.org/pdf/2403.06381
Copy Paste: [[2403.06381]] Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models(https://arxiv.org/abs/2403.06381)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended semantics of the associated text prompts. We examine cross-attention layers in diffusion models and observe a propensity for these layers to disproportionately focus on certain tokens during the generation process, thereby undermining semantic fidelity. To address the issue of dominant attention, we introduce attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Notably, our method requires no additional training or fine-tuning and serves as a plug-in module on a model. Hence, the generation capacity of the original model is fully preserved. We compare our approach with alternative approaches across various datasets, evaluation metrics, and diffusion models. Experiment results show that our method consistently outperforms other baselines, yielding images that more faithfully reflect the desired concepts with reduced computation overhead. Code is available at https://github.com/YaNgZhAnG-V5/attention_regulation.

Title: A Zero Trust Framework for Realization and Defense Against Generative AI Attacks in Power Grid

Authors: Md. Shirajum Munir, Sravanthi Proddatoori, Manjushree Muralidhara, Walid Saad, Zhu Han, Sachin Shetty
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06388
Pdf URL: https://arxiv.org/pdf/2403.06388
Copy Paste: [[2403.06388]] A Zero Trust Framework for Realization and Defense Against Generative AI Attacks in Power Grid(https://arxiv.org/abs/2403.06388)
Keywords: generative
Abstract: Understanding the potential of generative AI (GenAI)-based attacks on the power grid is a fundamental challenge that must be addressed in order to protect the power grid by realizing and validating risk in new attack vectors. In this paper, a novel zero trust framework for a power grid supply chain (PGSC) is proposed. This framework facilitates early detection of potential GenAI-driven attack vectors (e.g., replay and protocol-type attacks), assessment of tail risk-based stability measures, and mitigation of such threats. First, a new zero trust system model of PGSC is designed and formulated as a zero-trust problem that seeks to guarantee for a stable PGSC by realizing and defending against GenAI-driven cyber attacks. Second, in which a domain-specific generative adversarial networks (GAN)-based attack generation mechanism is developed to create a new vulnerability cyberspace for further understanding that threat. Third, tail-based risk realization metrics are developed and implemented for quantifying the extreme risk of a potential attack while leveraging a trust measurement approach for continuous validation. Fourth, an ensemble learning-based bootstrap aggregation scheme is devised to detect the attacks that are generating synthetic identities with convincing user and distributed energy resources device profiles. Experimental results show the efficacy of the proposed zero trust framework that achieves an accuracy of 95.7% on attack vector generation, a risk measure of 9.61% for a 95% stable PGSC, and a 99% confidence in defense against GenAI-driven attack.

Title: FSViewFusion: Few-Shots View Generation of Novel Objects

Authors: Rukhshanda Hussain, Hui Xian Grace Lim, Borchun Chen, Mubarak Shah, Ser Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06394
Pdf URL: https://arxiv.org/pdf/2403.06394
Copy Paste: [[2403.06394]] FSViewFusion: Few-Shots View Generation of Novel Objects(https://arxiv.org/abs/2403.06394)
Keywords: diffusion
Abstract: Novel view synthesis has observed tremendous developments since the arrival of NeRFs. However, Nerf models overfit on a single scene, lacking generalization to out of distribution objects. Recently, diffusion models have exhibited remarkable performance on introducing generalization in view synthesis. Inspired by these advancements, we explore the capabilities of a pretrained stable diffusion model for view synthesis without explicit 3D priors. Specifically, we base our method on a personalized text to image model, Dreambooth, given its strong ability to adapt to specific novel objects with a few shots. Our research reveals two interesting findings. First, we observe that Dreambooth can learn the high level concept of a view, compared to arguably more complex strategies which involve finetuning diffusions on large amounts of multi-view data. Second, we establish that the concept of a view can be disentangled and transferred to a novel object irrespective of the original object's identify from which the views are learnt. Motivated by this, we introduce a learning strategy, FSViewFusion, which inherits a specific view through only one image sample of a single scene, and transfers the knowledge to a novel object, learnt from few shots, using low rank adapters. Through extensive experiments we demonstrate that our method, albeit simple, is efficient in generating reliable view samples for in the wild images. Code and models will be released.

Title: DivCon: Divide and Conquer for Progressive Text-to-Image Generation

Authors: Yuhao Jia, Wenhan Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06400
Pdf URL: https://arxiv.org/pdf/2403.06400
Copy Paste: [[2403.06400]] DivCon: Divide and Conquer for Progressive Text-to-Image Generation(https://arxiv.org/abs/2403.06400)
Keywords: diffusion
Abstract: Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical \& spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.

Title: 'One size doesn't fit all': Learning how many Examples to use for In-Context Learning for Improved Text Classification

Authors: Manish Chandra, Debasis Ganguly, Yiwen Li, Iadh Ounis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06402
Pdf URL: https://arxiv.org/pdf/2403.06402
Copy Paste: [[2403.06402]] 'One size doesn't fit all': Learning how many Examples to use for In-Context Learning for Improved Text Classification(https://arxiv.org/abs/2403.06402)
Keywords: generative, in-context
Abstract: Predictive models in natural language processing (NLP) have evolved from training models from scratch to fine-tuning pre-trained models with labelled data. An extreme form of this fine-tuning involves in-context learning (ICL), where the output of a pre-trained generative model (frozen decoder parameters) is controlled only with variations in the input strings (called instructions or prompts). An important component of ICL is the use of a small number of labelled data instances as examples in the prompt. While existing work uses a static number of examples during inference for each data instance, in this paper we propose a novel methodology of dynamically adapting the number of examples as per the data. This is analogous to the use of a variable-sized neighborhood in k-nearest neighbors (k-NN) classifier. In our proposed workflow of adaptive ICL (AICL), the number of demonstrations to employ during the inference on a particular data instance is predicted by the Softmax posteriors of a classifier. The parameters of this classifier are fitted on the optimal number of examples in ICL required to correctly infer the label of each instance in the training set with the hypothesis that a test instance that is similar to a training instance should use the same (or a closely matching) number of few-shot examples. Our experiments show that our AICL method results in improvement in text classification task on several standard datasets.

Title: PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models

Authors: Qingdong He, Jinlong Peng, Zhengkai Jiang, Xiaobin Hu, Jiangning Zhang, Qiang Nie, Yabiao Wang, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06403
Pdf URL: https://arxiv.org/pdf/2403.06403
Copy Paste: [[2403.06403]] PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models(https://arxiv.org/abs/2403.06403)
Keywords: foundation model
Abstract: Recent success of vision foundation models have shown promising performance for the 2D perception tasks. However, it is difficult to train a 3D foundation network directly due to the limited dataset and it remains under explored whether existing foundation models can be lifted to 3D space seamlessly. In this paper, we present PointSeg, a novel training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks. PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames. Concretely, we design a two-branch prompts learning structure to construct the 3D point-box prompts pairs, combining with the bidirectional matching strategy for accurate point and proposal prompts generation. Then, we perform the iterative post-refinement adaptively when cooperated with different vision foundation models. Moreover, we design a affinity-aware merging algorithm to improve the final ensemble masks. PointSeg demonstrates impressive segmentation performance across various datasets, all without training. Specifically, our approach significantly surpasses the state-of-the-art specialist model by 13.4$\%$, 11.3$\%$, and 12$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top of that, PointSeg can incorporate with various segmentation models and even surpasses the supervised methods.

Title: Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

Authors: Weixia Zhang, Dingquan Li, Guangtao Zhai, Xiaokang Yang, Kede Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06406
Pdf URL: https://arxiv.org/pdf/2403.06406
Copy Paste: [[2403.06406]] Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents(https://arxiv.org/abs/2403.06406)
Keywords: diffusion
Abstract: Contemporary no-reference image quality assessment (NR-IQA) models can effectively quantify the perceived image quality, with high correlations between model predictions and human perceptual scores on fixed test sets. However, little progress has been made in comparing NR-IQA models from a perceptual optimization perspective. Here, for the first time, we demonstrate that NR-IQA models can be plugged into the maximum a posteriori (MAP) estimation framework for image enhancement. This is achieved by taking the gradients in differentiable and bijective diffusion latents rather than in the raw pixel domain. Different NR-IQA models are likely to induce different enhanced images, which are ultimately subject to psychophysical testing. This leads to a new computational method for comparing NR-IQA models within the analysis-by-synthesis framework. Compared to conventional correlation-based metrics, our method provides complementary insights into the relative strengths and weaknesses of the competing NR-IQA models in the context of perceptual optimization.

Title: A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos

Authors: Weixia Zhang, Chengguang Zhu, Jingnan Gao, Yichao Yan, Guangtao Zhai, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06421
Pdf URL: https://arxiv.org/pdf/2403.06421
Copy Paste: [[2403.06421]] A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos(https://arxiv.org/abs/2403.06421)
Keywords: generative
Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) technology has propelled audio-driven talking head generation, gaining considerable research attention for practical applications. However, performance evaluation research lags behind the development of talking head generation techniques. Existing literature relies on heuristic quantitative metrics without human validation, hindering accurate progress assessment. To address this gap, we collect talking head videos generated from four generative methods and conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness. Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures. We believe our work will facilitate performance evaluation and model development, providing insights into AIGC in a broader context. Code and data will be made available at https://github.com/zwx8981/ADTH-QA.

Title: Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain

Authors: Jungwon Choi, Hyungi Lee, Byung-Hoon Kim, Juho Lee
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2403.06432
Pdf URL: https://arxiv.org/pdf/2403.06432
Copy Paste: [[2403.06432]] Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain(https://arxiv.org/abs/2403.06432)
Keywords: self-supervised, generative
Abstract: Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. However, obtaining extensive labeled clinical data for training is often resource-intensive, making practical application difficult. Leveraging unlabeled data thus becomes crucial for representation learning in a label-scarce setting. Although generative self-supervised learning techniques, especially masked autoencoders, have shown promising results in representation learning in various domains, their application to dynamic graphs for dynamic functional connectivity remains underexplored, facing challenges in capturing high-level semantic representations. Here, we introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision. ST-JEMA employs a JEPA-inspired strategy for reconstructing dynamic graphs, which enables the learning of higher-level semantic representations considering temporal perspectives, addressing the challenges in fMRI data representation learning. Utilizing the large-scale UK Biobank dataset for self-supervised learning, ST-JEMA shows exceptional representation learning performance on dynamic functional connectivity demonstrating superiority over previous methods in predicting phenotypes and psychiatric diagnoses across eight benchmark fMRI datasets even with limited samples and effectiveness of temporal reconstruction on missing data scenarios. These findings highlight the potential of our approach as a robust representation learning method for leveraging label-scarce fMRI data.

Title: Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation

Authors: Guangyang Wu, Xiaohong Liu, Jun Jia, Xuehao Cui, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06452
Pdf URL: https://arxiv.org/pdf/2403.06452
Copy Paste: [[2403.06452]] Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation(https://arxiv.org/abs/2403.06452)
Keywords: diffusion
Abstract: In the digital era, QR codes serve as a linchpin connecting virtual and physical realms. Their pervasive integration across various applications highlights the demand for aesthetically pleasing codes without compromised scannability. However, prevailing methods grapple with the intrinsic challenge of balancing customization and scannability. Notably, stable-diffusion models have ushered in an epoch of high-quality, customizable content generation. This paper introduces Text2QR, a pioneering approach leveraging these advancements to address a fundamental challenge: concurrently achieving user-defined aesthetics and scanning robustness. To ensure stable generation of aesthetic QR codes, we introduce the QR Aesthetic Blueprint (QAB) module, generating a blueprint image exerting control over the entire generation process. Subsequently, the Scannability Enhancing Latent Refinement (SELR) process refines the output iteratively in the latent space, enhancing scanning robustness. This approach harnesses the potent generation capabilities of stable-diffusion models, navigating the trade-off between image aesthetics and QR code scannability. Our experiments demonstrate the seamless fusion of visual appeal with the practical utility of aesthetic QR codes, markedly outperforming prior methods. Codes are available at \url{https://github.com/mulns/Text2QR}

Title: 3D-aware Image Generation and Editing with Multi-modal Conditions

Authors: Bo Li, Yi-ke Li, Zhi-fen He, Bin Liu, Yun-Kun Lai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06470
Pdf URL: https://arxiv.org/pdf/2403.06470
Copy Paste: [[2403.06470]] 3D-aware Image Generation and Editing with Multi-modal Conditions(https://arxiv.org/abs/2403.06470)
Keywords: generative
Abstract: 3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.

Title: Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts

Authors: Jiawen Zhu, Guansong Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06495
Pdf URL: https://arxiv.org/pdf/2403.06495
Copy Paste: [[2403.06495]] Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts(https://arxiv.org/abs/2403.06495)
Keywords: anomaly, in-context
Abstract: This paper explores the problem of Generalist Anomaly Detection (GAD), aiming to train one single detection model that can generalize to detect anomalies in diverse datasets from different application domains without any further training on the target data. Some recent studies have shown that large pre-trained Visual-Language Models (VLMs) like CLIP have strong generalization capabilities on detecting industrial defects from various datasets, but their methods rely heavily on handcrafted text prompts about defects, making them difficult to generalize to anomalies in other applications, e.g., medical image anomalies or semantic anomalies in natural images. In this work, we propose to train a GAD model with few-shot normal images as sample prompts for AD on diverse datasets on the fly. To this end, we introduce a novel approach that learns an in-context residual learning model for GAD, termed InCTRL. It is trained on an auxiliary dataset to discriminate anomalies from normal samples based on a holistic evaluation of the residuals between query images and few-shot normal sample prompts. Regardless of the datasets, per definition of anomaly, larger residuals are expected for anomalies than normal samples, thereby enabling InCTRL to generalize across different domains without further training.

Title: Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning

Authors: Woojung Han, Chanyoung Kim, Dayun Ju, Yumin Shim, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06516
Pdf URL: https://arxiv.org/pdf/2403.06516
Copy Paste: [[2403.06516]] Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning(https://arxiv.org/abs/2403.06516)
Keywords: diffusion
Abstract: Recent advances in text-conditioned image generation diffusion models have begun paving the way for new opportunities in modern medical domain, in particular, generating Chest X-rays (CXRs) from diagnostic reports. Nonetheless, to further drive the diffusion models to generate CXRs that faithfully reflect the complexity and diversity of real data, it has become evident that a nontrivial learning approach is needed. In light of this, we propose CXRL, a framework motivated by the potential of reinforcement learning (RL). Specifically, we integrate a policy gradient RL approach with well-designed multiple distinctive CXR-domain specific reward models. This approach guides the diffusion denoising trajectory, achieving precise CXR posture and pathological details. Here, considering the complex medical image environment, we present "RL with Comparative Feedback" (RLCF) for the reward mechanism, a human-like comparative evaluation that is known to be more effective and reliable in complex scenarios compared to direct evaluation. Our CXRL framework includes jointly optimizing learnable adaptive condition embeddings (ACE) and the image generator, enabling the model to produce more accurate and higher perceptual CXR quality. Our extensive evaluation of the MIMIC-CXR-JPG dataset demonstrates the effectiveness of our RL-based tuning approach. Consequently, our CXRL generates pathologically realistic CXRs, establishing a new standard for generating CXRs with high fidelity to real-world clinical scenarios.

Title: Active Generation for Image Classification

Authors: Tao Huang, Jiaqi Liu, Shan You, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06517
Pdf URL: https://arxiv.org/pdf/2403.06517
Copy Paste: [[2403.06517]] Active Generation for Image Classification(https://arxiv.org/abs/2403.06517)
Keywords: diffusion, generative
Abstract: Recently, the growing capabilities of deep generative models have underscored their potential in enhancing image classification accuracy. However, existing methods often demand the generation of a disproportionately large number of images compared to the original dataset, while having only marginal improvements in accuracy. This computationally expensive and time-consuming process hampers the practicality of such approaches. In this paper, we propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model. With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation. It aims to create images akin to the challenging or misclassified samples encountered by the current model and incorporates these generated images into the training set to augment model performance. ActGen introduces an attentive image guidance technique, using real images as guides during the denoising process of a diffusion model. The model's attention on class prompt is leveraged to ensure the preservation of similar foreground object while diversifying the background. Furthermore, we introduce a gradient-based generation guidance method, which employs two losses to generate more challenging samples and prevent the generated images from being too similar to previously generated ones. Experimental results on the CIFAR and ImageNet datasets demonstrate that our method achieves better performance with a significantly reduced number of generated images.

Title: OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation

Authors: Baran Ozaydin, Tong Zhang, Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06546
Pdf URL: https://arxiv.org/pdf/2403.06546
Copy Paste: [[2403.06546]] OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation(https://arxiv.org/abs/2403.06546)
Keywords: self-supervised
Abstract: Unsupervised Semantic Segmentation (USS) involves segmenting images without relying on predefined labels, aiming to alleviate the burden of extensive human labeling. Existing methods utilize features generated by self-supervised models and specific priors for clustering. However, their clustering objectives are not involved in the optimization of the features during training. Additionally, due to the lack of clear class definitions in USS, the resulting segments may not align well with the clustering objective. In this paper, we introduce a novel approach called Optimally Matched Hierarchy (OMH) to simultaneously address the above issues. The core of our method lies in imposing structured sparsity on the feature space, which allows the features to encode information with different levels of granularity. The structure of this sparsity stems from our hierarchy (OMH). To achieve this, we learn a soft but sparse hierarchy among parallel clusters through Optimal Transport. Our OMH yields better unsupervised segmentation performance compared to existing USS methods. Our extensive experiments demonstrate the benefits of OMH when utilizing our differentiable paradigm. We will make our code publicly available.

Title: Detection of Object Throwing Behavior in Surveillance Videos

Authors: Ivo P.C. Kersten, Erkut Akdag, Egor Bondarev, Peter H. N. De With
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06552
Pdf URL: https://arxiv.org/pdf/2403.06552
Copy Paste: [[2403.06552]] Detection of Object Throwing Behavior in Surveillance Videos(https://arxiv.org/abs/2403.06552)
Keywords: anomaly
Abstract: Anomalous behavior detection is a challenging research area within computer vision. Progress in this area enables automated detection of dangerous behavior using surveillance camera feeds. A dangerous behavior that is often overlooked in other research is the throwing action in traffic flow, which is one of the unique requirements of our Smart City project to enhance public safety. This paper proposes a solution for throwing action detection in surveillance videos using deep learning. At present, datasets for throwing actions are not publicly available. To address the use-case of our Smart City project, we first generate the novel public 'Throwing Action' dataset, consisting of 271 videos of throwing actions performed by traffic participants, such as pedestrians, bicyclists, and car drivers, and 130 normal videos without throwing actions. Second, we compare the performance of different feature extractors for our anomaly detection method on the UCF-Crime and Throwing-Action datasets. The explored feature extractors are the Convolutional 3D (C3D) network, the Inflated 3D ConvNet (I3D) network, and the Multi-Fiber Network (MFNet). Finally, the performance of the anomaly detection algorithm is improved by applying the Adam optimizer instead of Adadelta, and proposing a mean normal loss function that covers the multitude of normal situations in traffic. Both aspects yield better anomaly detection performance. Besides this, the proposed mean normal loss function lowers the false alarm rate on the combined dataset. The experimental results reach an area under the ROC curve of 86.10 for the Throwing-Action dataset, and 80.13 on the combined dataset, respectively.

Title: Leveraging Foundation Models for Content-Based Medical Image Retrieval in Radiology

Authors: Stefan Denner, David Zimmerer, Dimitrios Bounias, Markus Bujotzek, Shuhan Xiao, Lisa Kausch, Philipp Schader, Tobias Penzkofer, Paul F. Jäger, Klaus Maier-Hein
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2403.06567
Pdf URL: https://arxiv.org/pdf/2403.06567
Copy Paste: [[2403.06567]] Leveraging Foundation Models for Content-Based Medical Image Retrieval in Radiology(https://arxiv.org/abs/2403.06567)
Keywords: foundation model
Abstract: Content-based image retrieval (CBIR) has the potential to significantly improve diagnostic aid and medical research in radiology. Current CBIR systems face limitations due to their specialization to certain pathologies, limiting their utility. In response, we propose using vision foundation models as powerful and versatile off-the-shelf feature extractors for content-based medical image retrieval. By benchmarking these models on a comprehensive dataset of 1.6 million 2D radiological images spanning four modalities and 161 pathologies, we identify weakly-supervised models as superior, achieving a P@1 of up to 0.594. This performance not only competes with a specialized model but does so without the need for fine-tuning. Our analysis further explores the challenges in retrieving pathological versus anatomical structures, indicating that accurate retrieval of pathological features presents greater difficulty. Despite these challenges, our research underscores the vast potential of foundation models for CBIR in radiology, proposing a shift towards versatile, general-purpose medical image retrieval systems that do not require specific tuning.

Title: FFAD: A Novel Metric for Assessing Generated Time Series Data Utilizing Fourier Transform and Auto-encoder

Authors: Yang Chen, Dustin J. Kempton, Rafal A. Angryk
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.06576
Pdf URL: https://arxiv.org/pdf/2403.06576
Copy Paste: [[2403.06576]] FFAD: A Novel Metric for Assessing Generated Time Series Data Utilizing Fourier Transform and Auto-encoder(https://arxiv.org/abs/2403.06576)
Keywords: generative
Abstract: The success of deep learning-based generative models in producing realistic images, videos, and audios has led to a crucial consideration: how to effectively assess the quality of synthetic samples. While the Fr\'{e}chet Inception Distance (FID) serves as the standard metric for evaluating generative models in image synthesis, a comparable metric for time series data is notably absent. This gap in assessment capabilities stems from the absence of a widely accepted feature vector extractor pre-trained on benchmark time series datasets. In addressing these challenges related to assessing the quality of time series, particularly in the context of Fr\'echet Distance, this work proposes a novel solution leveraging the Fourier transform and Auto-encoder, termed the Fr\'{e}chet Fourier-transform Auto-encoder Distance (FFAD). Through our experimental results, we showcase the potential of FFAD for effectively distinguishing samples from different classes. This novel metric emerges as a fundamental tool for the evaluation of generative time series data, contributing to the ongoing efforts of enhancing assessment methodologies in the realm of deep learning-based generative models.

Title: Distributionally Generative Augmentation for Fair Facial Attribute Classification

Authors: Fengda Zhang, Qianpei He, Kun Kuang, Jiashuo Liu, Long Chen, Chao Wu, Jun Xiao, Hanwang Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06606
Pdf URL: https://arxiv.org/pdf/2403.06606
Copy Paste: [[2403.06606]] Distributionally Generative Augmentation for Fair Facial Attribute Classification(https://arxiv.org/abs/2403.06606)
Keywords: generative
Abstract: Facial Attribute Classification (FAC) holds substantial promise in widespread applications. However, FAC models trained by traditional methodologies can be unfair by exhibiting accuracy inconsistencies across varied data subpopulations. This unfairness is largely attributed to bias in data, where some spurious attributes (e.g., Male) statistically correlate with the target attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the labels of spurious attributes, which may be unavailable in practice. This work proposes a novel, generation-based two-stage framework to train a fair FAC model on biased data without additional annotation. Initially, we identify the potential spurious attributes based on generative models. Notably, it enhances interpretability by explicitly showing the spurious attributes in image space. Following this, for each image, we first edit the spurious attributes with a random degree sampled from a uniform distribution, while keeping target attribute unchanged. Then we train a fair FAC model by fostering model invariance to these augmentation. Extensive experiments on three common datasets demonstrate the effectiveness of our method in promoting fairness in FAC without compromising accuracy. Codes are in https://github.com/heqianpei/DiGA.

Title: Guiding Clinical Reasoning with Large Language Models via Knowledge Seeds

Authors: Jiageng WU, Xian Wu, Jie Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06609
Pdf URL: https://arxiv.org/pdf/2403.06609
Copy Paste: [[2403.06609]] Guiding Clinical Reasoning with Large Language Models via Knowledge Seeds(https://arxiv.org/abs/2403.06609)
Keywords: in-context
Abstract: Clinical reasoning refers to the cognitive process that physicians employ in evaluating and managing patients. This process typically involves suggesting necessary examinations, diagnosing patients' diseases, and deciding on appropriate therapies, etc. Accurate clinical reasoning requires extensive medical knowledge and rich clinical experience, setting a high bar for physicians. This is particularly challenging in developing countries due to the overwhelming number of patients and limited physician resources, contributing significantly to global health inequity and necessitating automated clinical reasoning approaches. Recently, the emergence of large language models (LLMs) such as ChatGPT and GPT-4 have demonstrated their potential in clinical reasoning. However, these LLMs are prone to hallucination problems, and the reasoning process of LLMs may not align with the clinical decision path of physicians. In this study, we introduce a novel framework, In-Context Padding (ICP), designed to enhance LLMs with medical knowledge. Specifically, we infer critical clinical reasoning elements (referred to as knowledge seeds) and use these as anchors to guide the generation process of LLMs. Experiments on two clinical question datasets demonstrate that ICP significantly improves the clinical reasoning ability of LLMs.

Title: Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

Authors: Henrique Jesus, Hugo Proença
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06658
Pdf URL: https://arxiv.org/pdf/2403.06658
Copy Paste: [[2403.06658]] Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework(https://arxiv.org/abs/2403.06658)
Keywords: generative
Abstract: Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").

Title: Car Damage Detection and Patch-to-Patch Self-supervised Image Alignment

Authors: Hanxiao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06674
Pdf URL: https://arxiv.org/pdf/2403.06674
Copy Paste: [[2403.06674]] Car Damage Detection and Patch-to-Patch Self-supervised Image Alignment(https://arxiv.org/abs/2403.06674)
Keywords: self-supervised
Abstract: Most computer vision applications aim to identify pixels in a scene and use them for diverse purposes. One intriguing application is car damage detection for insurance carriers which tends to detect all car damages by comparing both pre-trip and post-trip images, even requiring two components: (i) car damage detection; (ii) image alignment. Firstly, we implemented a Mask R-CNN model to detect car damages on custom images. Whereas for the image alignment section, we especially propose a novel self-supervised Patch-to-Patch SimCLR inspired alignment approach to find perspective transformations between custom pre/post car rental images except for traditional computer vision methods.

Title: Trustworthy Partial Label Learning with Out-of-distribution Detection

Authors: Jintao Huang, Yiu-Ming Cheung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06681
Pdf URL: https://arxiv.org/pdf/2403.06681
Copy Paste: [[2403.06681]] Trustworthy Partial Label Learning with Out-of-distribution Detection(https://arxiv.org/abs/2403.06681)
Keywords: self-supervised
Abstract: Partial Label Learning (PLL) grapples with learning from ambiguously labelled data, and it has been successfully applied in fields such as image recognition. Nevertheless, traditional PLL methods rely on the closed-world assumption, which can be limiting in open-world scenarios and negatively impact model performance and generalization. To tackle these challenges, our study introduces a novel method called PLL-OOD, which is the first to incorporate Out-of-Distribution (OOD) detection into the PLL framework. PLL-OOD significantly enhances model adaptability and accuracy by merging self-supervised learning with partial label loss and pioneering the Partial-Energy (PE) score for OOD detection. This approach improves data feature representation and effectively disambiguates candidate labels, using a dynamic label confidence matrix to refine predictions. The PE score, adjusted by label confidence, precisely identifies OOD instances, optimizing model training towards in-distribution data. This innovative method markedly boosts PLL model robustness and performance in open-world settings. To validate our approach, we conducted a comprehensive comparative experiment combining the existing state-of-the-art PLL model with multiple OOD scores on the CIFAR-10 and CIFAR-100 datasets with various OOD datasets. The results demonstrate that the proposed PLL-OOD framework is highly effective and effectiveness outperforms existing models, showcasing its superiority and effectiveness.

Title: PCLD: Point Cloud Layerwise Diffusion for Adversarial Purification

Authors: Mert Gulsen, Batuhan Cengiz, Yusuf H. Sahin, Gozde Unal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06698
Pdf URL: https://arxiv.org/pdf/2403.06698
Copy Paste: [[2403.06698]] PCLD: Point Cloud Layerwise Diffusion for Adversarial Purification(https://arxiv.org/abs/2403.06698)
Keywords: diffusion
Abstract: Point clouds are extensively employed in a variety of real-world applications such as robotics, autonomous driving and augmented reality. Despite the recent success of point cloud neural networks, especially for safety-critical tasks, it is essential to also ensure the robustness of the model. A typical way to assess a model's robustness is through adversarial attacks, where test-time examples are generated based on gradients to deceive the model. While many different defense mechanisms are studied in 2D, studies on 3D point clouds have been relatively limited in the academic field. Inspired from PointDP, which denoises the network inputs by diffusion, we propose Point Cloud Layerwise Diffusion (PCLD), a layerwise diffusion based 3D point cloud defense strategy. Unlike PointDP, we propagated the diffusion denoising after each layer to incrementally enhance the results. We apply our defense method to different types of commonly used point cloud models and adversarial attacks to evaluate its robustness. Our experiments demonstrate that the proposed defense method achieved results that are comparable to or surpass those of existing methodologies, establishing robustness through a novel technique. Code is available at https://github.com/batuceng/diffusion-layer-robustness-pc.

Title: Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback

Authors: Adarsh N L, Arun P V, Aravindh N L
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06735
Pdf URL: https://arxiv.org/pdf/2403.06735
Copy Paste: [[2403.06735]] Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback(https://arxiv.org/abs/2403.06735)
Keywords: generative
Abstract: Research on generative models to produce human-aligned / human-preferred outputs has seen significant recent contributions. Between text and image-generative models, we narrowed our focus to text-based generative models, particularly to produce captions for images that align with human preferences. In this research, we explored a potential method to amplify the performance of the Deep Neural Network Model to generate captions that are preferred by humans. This was achieved by integrating Supervised Learning and Reinforcement Learning with Human Feedback (RLHF) using the Flickr8k dataset. Also, a novel loss function that is capable of optimizing the model based on human feedback is introduced. In this paper, we provide a concise sketch of our approach and results, hoping to contribute to the ongoing advances in the field of human-aligned generative AI models.

Title: V3D: Video Diffusion Models are Effective 3D Generators

Authors: Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, Huaping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06738
Pdf URL: https://arxiv.org/pdf/2403.06738
Copy Paste: [[2403.06738]] V3D: Video Diffusion Models are Effective 3D Generators(https://arxiv.org/abs/2403.06738)
Keywords: diffusion
Abstract: Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency. Our code is available at https://github.com/heheyas/V3D

Title: Distribution-Aware Data Expansion with Diffusion Models

Authors: Haowei Zhu, Ling Yang, Jun-Hai Yong, Wentao Zhang, Bin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06741
Pdf URL: https://arxiv.org/pdf/2403.06741
Copy Paste: [[2403.06741]] Distribution-Aware Data Expansion with Diffusion Models(https://arxiv.org/abs/2403.06741)
Keywords: diffusion
Abstract: The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion methods encompass image transformation-based and synthesis-based methods. The transformation-based methods introduce only local variations, resulting in poor diversity. While image synthesis-based methods can create entirely new content, significantly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, an effective data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its ability to generate distribution-consistent samples, achieving substantial improvements in data expansion tasks. Specifically, without additional training, DistDiff achieves a 30.7% improvement in accuracy across six image datasets compared to the model trained on original datasets and a 9.8% improvement compared to the state-of-the-art diffusion-based method. Our code is available at https://github.com/haoweiz23/DistDiff

Title: Boosting Image Restoration via Priors from Pre-trained Models

Authors: Xiaogang Xu, Shu Kong, Tao Hu, Zhe Liu, Hujun Bao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06793
Pdf URL: https://arxiv.org/pdf/2403.06793
Copy Paste: [[2403.06793]] Boosting Image Restoration via Priors from Pre-trained Models(https://arxiv.org/abs/2403.06793)
Keywords: diffusion
Abstract: Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size ($<$1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

Title: Data-Independent Operator: A Training-Free Artifact Representation Extractor for Generalizable Deepfake Detection

Authors: Chuangchuang Tan, Ping Liu, RenShuai Tao, Huan Liu, Yao Zhao, Baoyuan Wu, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06803
Pdf URL: https://arxiv.org/pdf/2403.06803
Copy Paste: [[2403.06803]] Data-Independent Operator: A Training-Free Artifact Representation Extractor for Generalizable Deepfake Detection(https://arxiv.org/abs/2403.06803)
Keywords: generative
Abstract: Recently, the proliferation of increasingly realistic synthetic images generated by various generative adversarial networks has increased the risk of misuse. Consequently, there is a pressing need to develop a generalizable detector for accurately recognizing fake images. The conventional methods rely on generating diverse training sources or large pretrained models. In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations. Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources. In our framework, handcrafted filters and the randomly-initialized convolutional layer can be used as the training-free artifact representations extractor with excellent results. With the data-independent operator of a popular classifier, such as Resnet50, one could already reach a new state-of-the-art without bells and whistles. We evaluate the effectiveness of the DIO on 33 generation models, even DALLE and Midjourney. Our detector achieves a remarkable improvement of $13.3\%$, establishing a new state-of-the-art performance. The DIO and its extension can serve as strong baselines for future methods. The code is available at \url{https://github.com/chuangchuangtan/Data-Independent-Operator}.

Title: Multistep Consistency Models

Authors: Jonathan Heek, Emiel Hoogeboom, Tim Salimans
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2403.06807
Pdf URL: https://arxiv.org/pdf/2403.06807
Copy Paste: [[2403.06807]] Multistep Consistency Models(https://arxiv.org/abs/2403.06807)
Keywords: diffusion
Abstract: Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas we show that a $\infty$-step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1 FID on Imagenet128 in 8 steps with consistency distillation. We also show that our method scales to a text-to-image diffusion model, generating samples that are very close to the quality of the original model.

Title: In-context Exploration-Exploitation for Reinforcement Learning

Authors: Zhenwen Dai, Federico Tomasi, Sina Ghiassian
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2403.06826
Pdf URL: https://arxiv.org/pdf/2403.06826
Copy Paste: [[2403.06826]] In-context Exploration-Exploitation for Reinforcement Learning(https://arxiv.org/abs/2403.06826)
Keywords: in-context
Abstract: In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.

Title: Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting

Authors: Wenting Chen, Pengyu Wang, Hui Ren, Lichao Sun, Quanzheng Li, Yixuan Yuan, Xiang Li
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.06835
Pdf URL: https://arxiv.org/pdf/2403.06835
Copy Paste: [[2403.06835]] Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting(https://arxiv.org/abs/2403.06835)
Keywords: generative
Abstract: Data scarcity and privacy concerns limit the availability of high-quality medical images for public use, which can be mitigated through medical image synthesis. However, current medical image synthesis methods often struggle to accurately capture the complexity of detailed anatomical structures and pathological conditions. To address these challenges, we propose a novel medical image synthesis model that leverages fine-grained image-text alignment and anatomy-pathology prompts to generate highly detailed and accurate synthetic medical images. Our method integrates advanced natural language processing techniques with image generative modeling, enabling precise alignment between descriptive text prompts and the synthesized images' anatomical and pathological details. The proposed approach consists of two key components: an anatomy-pathology prompting module and a fine-grained alignment-based synthesis module. The anatomy-pathology prompting module automatically generates descriptive prompts for high-quality medical images. To further synthesize high-quality medical images from the generated prompts, the fine-grained alignment-based synthesis module pre-defines a visual codebook for the radiology dataset and performs fine-grained alignment between the codebook and generated prompts to obtain key patches as visual clues, facilitating accurate image synthesis. We validate the superiority of our method through experiments on public chest X-ray datasets and demonstrate that our synthetic images preserve accurate semantic information, making them valuable for various medical applications.

Title: Stochastic Cortical Self-Reconstruction

Authors: Christian Wachinger, Dennis Hedderich, Fabian Bongratz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06837
Pdf URL: https://arxiv.org/pdf/2403.06837
Copy Paste: [[2403.06837]] Stochastic Cortical Self-Reconstruction(https://arxiv.org/abs/2403.06837)
Keywords: generative
Abstract: Magnetic resonance imaging (MRI) is critical for diagnosing neurodegenerative diseases, yet accurately assessing mild cortical atrophy remains a challenge due to its subtlety. Automated cortex reconstruction, paired with healthy reference ranges, aids in pinpointing pathological atrophy, yet their generalization is limited by biases from image acquisition and processing. We introduce the concept of stochastic cortical self-reconstruction (SCSR) that creates a subject-specific healthy reference by taking MRI-derived thicknesses as input and, therefore, implicitly accounting for potential confounders. SCSR randomly corrupts parts of the cortex and self-reconstructs them from the remaining information. Trained exclusively on healthy individuals, repeated self-reconstruction generates a stochastic reference cortex for assessing deviations from the norm. We present three implementations of this concept: XGBoost applied on parcels, and two autoencoders on vertex level -- one based on a multilayer perceptron and the other using a spherical U-Net. These models were trained on healthy subjects from the UK Biobank and subsequently evaluated across four public Alzheimer's datasets. Finally, we deploy the model on clinical in-house data, where deviation maps' high spatial resolution aids in discriminating between four types of dementia.

Title: QUASAR: QUality and Aesthetics Scoring with Advanced Representations

Authors: Sergey Kastryulin (1 and 3), Denis Prokopenko (2), Artem Babenko (3), Dmitry V. Dylov (1 and 4) ((1) Skolkovo Institute of Science and Technology, (2) King's Colledge London, (3) Yandex, (4) AIRI)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06866
Pdf URL: https://arxiv.org/pdf/2403.06866
Copy Paste: [[2403.06866]] QUASAR: QUality and Aesthetics Scoring with Advanced Representations(https://arxiv.org/abs/2403.06866)
Keywords: self-supervised
Abstract: This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment, surpassing existing approaches and requiring no prompt engineering or fine-tuning. We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data. Through extensive evaluations of 7 state-of-the-art self-supervised models, our method demonstrates superior performance and robustness across various datasets and benchmarks. Notably, it achieves high agreement with human assessments even with limited data and shows high robustness to the nature of data and their pre-processing pipeline. Our contributions offer a streamlined solution for assessment of images while providing insights into the perception of visual information.

Title: Learning with Noisy Foundation Models

Authors: Hao Chen, Jindong Wang, Zihan Wang, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.06869
Pdf URL: https://arxiv.org/pdf/2403.06869
Copy Paste: [[2403.06869]] Learning with Noisy Foundation Models(https://arxiv.org/abs/2403.06869)
Keywords: self-supervised, foundation model
Abstract: Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Learning.

Title: COOD: Combined out-of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification

Authors: L. E. Hogeweg, R. Gangireddy, D. Brunink, V. J. Kalkman, L. Cornelissen, J.W. Kamminga
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06874
Pdf URL: https://arxiv.org/pdf/2403.06874
Copy Paste: [[2403.06874]] COOD: Combined out-of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification(https://arxiv.org/abs/2403.06874)
Keywords: anomaly
Abstract: High-performing out-of-distribution (OOD) detection, both anomaly and novel class, is an important prerequisite for the practical use of classification models. In this paper, we focus on the species recognition task in images concerned with large databases, a large number of fine-grained hierarchical classes, severe class imbalance, and varying image quality. We propose a framework for combining individual OOD measures into one combined OOD (COOD) measure using a supervised model. The individual measures are several existing state-of-the-art measures and several novel OOD measures developed with novel class detection and hierarchical class structure in mind. COOD was extensively evaluated on three large-scale (500k+ images) biodiversity datasets in the context of anomaly and novel class detection. We show that COOD outperforms individual, including state-of-the-art, OOD measures by a large margin in terms of TPR@1% FPR in the majority of experiments, e.g., improving detecting ImageNet images (OOD) from 54.3% to 85.4% for the iNaturalist 2018 dataset. SHAP (feature contribution) analysis shows that different individual OOD measures are essential for various tasks, indicating that multiple OOD measures and combinations are needed to generalize. Additionally, we show that explicitly considering ID images that are incorrectly classified for the original (species) recognition task is important for constructing high-performing OOD detection methods and for practical applicability. The framework can easily be extended or adapted to other tasks and media modalities.

Title: MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning

Authors: Yichuan Li, Xiyao Ma, Sixing Lu, Kyumin Lee, Xiaohu Liu, Chenlei Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.06914
Pdf URL: https://arxiv.org/pdf/2403.06914
Copy Paste: [[2403.06914]] MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning(https://arxiv.org/abs/2403.06914)
Keywords: in-context
Abstract: Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models

Title: DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

Authors: Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06951
Pdf URL: https://arxiv.org/pdf/2403.06951
Copy Paste: [[2403.06951]] DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations(https://arxiv.org/abs/2403.06951)
Keywords: diffusion
Abstract: The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce \textit{DEADiff} to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is~\href{https://tianhao-qi.github.io/DEADiff/}{https://tianhao-qi.github.io/DEADiff/}.

Title: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06952
Pdf URL: https://arxiv.org/pdf/2403.06952
Copy Paste: [[2403.06952]] SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data(https://arxiv.org/abs/2403.06952)
Keywords: diffusion, in-context
Abstract: Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging. First, SELMA leverages an LLM's in-context learning capability to generate multiple datasets of text prompts that can teach different skills, and then generates the images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to the new skills by learning multiple single-skill LoRA (low-rank adaptation) experts followed by expert merging. Our independent expert fine-tuning specializes multiple models for different skills, and expert merging helps build a joint multi-skill T2I model that can generate faithful images given diverse text prompts, while mitigating the knowledge conflict from different datasets. We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (PickScore, ImageReward, and HPS), as well as human evaluation. Moreover, fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data. Lastly, we show that fine-tuning with images from a weaker T2I model can help improve the generation quality of a stronger T2I model, suggesting promising weak-to-strong generalization in T2I models.

Title: Bayesian Diffusion Models for 3D Shape Reconstruction

Authors: Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, Zhuowen Tu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.06973
Pdf URL: https://arxiv.org/pdf/2403.06973
Copy Paste: [[2403.06973]] Bayesian Diffusion Models for 3D Shape Reconstruction(https://arxiv.org/abs/2403.06973)
Keywords: diffusion
Abstract: We present Bayesian Diffusion Models (BDM), a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We show the effectiveness of BDM on the 3D shape reconstruction task. Compared to prototypical deep learning data-driven approaches trained on paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM brings in rich prior information from standalone labels (e.g. point clouds) to improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian frameworks where explicit prior and likelihood are required for the inference, BDM performs seamless information fusion via coupled diffusion processes with learned gradient computation networks. The specialty of our BDM lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction.

Title: BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Authors: Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.06976
Pdf URL: https://arxiv.org/pdf/2403.06976
Copy Paste: [[2403.06976]] BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion(https://arxiv.org/abs/2403.06976)
Keywords: diffusion
Abstract: Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.