2024-03-19

Title: VISREAS: Complex Visual Reasoning with Unanswerable Questions

Authors: Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10534
Pdf URL: https://arxiv.org/pdf/2403.10534
Copy Paste: [[2403.10534]] VISREAS: Complex Visual Reasoning with Unanswerable Questions(https://arxiv.org/abs/2403.10534)
Keywords: generative
Abstract: Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.

Title: Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing Flows

Authors: Zhangxuan Dang, Yu Zheng, Xinglin Lin, Chunlei Peng, Qiuyu Chen, Xinbo Gao
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10550
Pdf URL: https://arxiv.org/pdf/2403.10550
Copy Paste: [[2403.10550]] Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing Flows(https://arxiv.org/abs/2403.10550)
Keywords: anomaly
Abstract: With the rapid development of the Internet, various types of anomaly traffic are threatening network security. We consider the problem of anomaly network traffic detection and propose a three-stage anomaly detection framework using only normal traffic. Our framework can generate pseudo anomaly samples without prior knowledge of anomalies to achieve the detection of anomaly data. Firstly, we employ a reconstruction method to learn the deep representation of normal samples. Secondly, these representations are normalized to a standard normal distribution using a bidirectional flow module. To simulate anomaly samples, we add noises to the normalized representations which are then passed through the generation direction of the bidirectional flow module. Finally, a simple classifier is trained to differentiate the normal samples and pseudo anomaly samples in the latent space. During inference, our framework requires only two modules to detect anomalous samples, leading to a considerable reduction in model size. According to the experiments, our method achieves the state of-the-art results on the common benchmarking datasets of anomaly network traffic detection. The code is given in the https://github.com/ZxuanDang/ATD-via-Flows.git

Title: Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI

Authors: Dong Shu, Zhouyao Zhu
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2403.10559
Pdf URL: https://arxiv.org/pdf/2403.10559
Copy Paste: [[2403.10559]] Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI(https://arxiv.org/abs/2403.10559)
Keywords: generative
Abstract: This report investigates the history and impact of Generative Models and Connected and Automated Vehicles (CAVs), two groundbreaking forces pushing progress in technology and transportation. By focusing on the application of generative models within the context of CAVs, the study aims to unravel how this integration could enhance predictive modeling, simulation accuracy, and decision-making processes in autonomous vehicles. This thesis discusses the benefits and challenges of integrating generative models and CAV technology in transportation. It aims to highlight the progress made, the remaining obstacles, and the potential for advancements in safety and innovation.

Title: Cooling-Guide Diffusion Model for Battery Cell Arrangement

Authors: Nicholas Sung, Liu Zheng, Pingfeng Wang, Faez Ahmed
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10566
Pdf URL: https://arxiv.org/pdf/2403.10566
Copy Paste: [[2403.10566]] Cooling-Guide Diffusion Model for Battery Cell Arrangement(https://arxiv.org/abs/2403.10566)
Keywords: diffusion, generative
Abstract: Our study introduces a Generative AI method that employs a cooling-guided diffusion model to optimize the layout of battery cells, a crucial step for enhancing the cooling performance and efficiency of battery thermal management systems. Traditional design processes, which rely heavily on iterative optimization and extensive guesswork, are notoriously slow and inefficient, often leading to suboptimal solutions. In contrast, our innovative method uses a parametric denoising diffusion probabilistic model (DDPM) with classifier and cooling guidance to generate optimized cell layouts with enhanced cooling paths, significantly lowering the maximum temperature of the cells. By incorporating position-based classifier guidance, we ensure the feasibility of generated layouts. Meanwhile, cooling guidance directly optimizes cooling-efficiency, making our approach uniquely effective. When compared to two advanced models, the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) and the Conditional Tabular GAN (CTGAN), our cooling-guided diffusion model notably outperforms both. It is five times more effective than TabDDPM and sixty-six times better than CTGAN across key metrics such as feasibility, diversity, and cooling efficiency. This research marks a significant leap forward in the field, aiming to optimize battery cell layouts for superior cooling efficiency, thus setting the stage for the development of more effective and dependable battery thermal management systems.

Title: MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

Authors: Ruixiang Jiang, Lingbo Liu, Changwen Chen
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.10568
Pdf URL: https://arxiv.org/pdf/2403.10568
Copy Paste: [[2403.10568]] MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts(https://arxiv.org/abs/2403.10568)
Keywords: foundation model
Abstract: Prompt-tuning has demonstrated parameter-efficiency in fusing unimodal foundation models for multimodal tasks. However, its limited adaptivity and expressiveness lead to suboptimal performance when compared with other tuning methods. In this paper, we address this issue by disentangling the vanilla prompts to adaptively capture dataset-level and instance-level features. Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness. MoPE leverages multimodal pairing priors to route the most effective prompt on a per-instance basis. Compared to vanilla prompting, our MoPE-based conditional prompting exhibits greater expressiveness for multimodal fusion, scaling better with the training data and the overall number of trainable parameters. We also study a regularization term for expert routing, leading to emergent expert specialization, where different experts focus on different concepts, enabling interpretable soft prompting. Extensive experiments across three multimodal datasets demonstrate that our method achieves state-of-the-art results, matching or even surpassing the performance of fine-tuning, while requiring only 0.8% of the trainable parameters. Code will be released: https://github.com/songrise/MoPE.

Title: Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber Warfare

Authors: Tao Li, Quanyan Zhu
Subjects: cs.CR, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2403.10570
Pdf URL: https://arxiv.org/pdf/2403.10570
Copy Paste: [[2403.10570]] Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber Warfare(https://arxiv.org/abs/2403.10570)
Keywords: foundation model
Abstract: We are currently facing unprecedented cyber warfare with the rapid evolution of tactics, increasing asymmetry of intelligence, and the growing accessibility of hacking tools. In this landscape, cyber deception emerges as a critical component of our defense strategy against increasingly sophisticated attacks. This chapter aims to highlight the pivotal role of game-theoretic models and foundation models (FMs) in analyzing, designing, and implementing cyber deception tactics. Game models (GMs) serve as a foundational framework for modeling diverse adversarial interactions, allowing us to encapsulate both adversarial knowledge and domain-specific insights. Meanwhile, FMs serve as the building blocks for creating tailored machine learning models suited to given applications. By leveraging the synergy between GMs and FMs, we can advance proactive and automated cyber defense mechanisms by not only securing our networks against attacks but also enhancing their resilience against well-planned operations. This chapter discusses the games at the tactical, operational, and strategic levels of warfare, delves into the symbiotic relationship between these methodologies, and explores relevant applications where such a framework can make a substantial impact in cybersecurity. The chapter discusses the promising direction of the multi-agent neurosymbolic conjectural learning (MANSCOL), which allows the defender to predict adversarial behaviors, design adaptive defensive deception tactics, and synthesize knowledge for the operational level synthesis and adaptation. FMs serve as pivotal tools across various functions for MANSCOL, including reinforcement learning, knowledge assimilation, formation of conjectures, and contextual representation. This chapter concludes with a discussion of the challenges associated with FMs and their application in the domain of cybersecurity.

Title: Neural Erosion: Emulating Controlled Neurodegeneration and Aging in AI Systems

Authors: Antonios Alexos, Yu-Dai Tsai, Ian Domingo, Maryam Pishgar, Pierre Baldi
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2403.10596
Pdf URL: https://arxiv.org/pdf/2403.10596
Copy Paste: [[2403.10596]] Neural Erosion: Emulating Controlled Neurodegeneration and Aging in AI Systems(https://arxiv.org/abs/2403.10596)
Keywords: generative
Abstract: Creating controlled methods to simulate neurodegeneration in artificial intelligence (AI) is crucial for applications that emulate brain function decline and cognitive disorders. We use IQ tests performed by Large Language Models (LLMs) and, more specifically, the LLaMA 2 to introduce the concept of ``neural erosion." This deliberate erosion involves ablating synapses or neurons, or adding Gaussian noise during or after training, resulting in a controlled progressive decline in the LLMs' performance. We are able to describe the neurodegeneration in the IQ tests and show that the LLM first loses its mathematical abilities and then its linguistic abilities, while further losing its ability to understand the questions. To the best of our knowledge, this is the first work that models neurodegeneration with text data, compared to other works that operate in the computer vision domain. Finally, we draw similarities between our study and cognitive decline clinical studies involving test subjects. We find that with the application of neurodegenerative methods, LLMs lose abstract thinking abilities, followed by mathematical degradation, and ultimately, a loss in linguistic ability, responding to prompts incoherently. These findings are in accordance with human studies.

Title: LightIt: Illumination Modeling and Control for Diffusion Models

Authors: Peter Kocsis (1), Julien Philip (2), Kalyan Sunkavalli (2), Matthias Nießner (1), Yannick Hold-Geoffroy (2) ((1) Technical University of Munich, (2) Adobe Research)
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10615
Pdf URL: https://arxiv.org/pdf/2403.10615
Copy Paste: [[2403.10615]] LightIt: Illumination Modeling and Control for Diffusion Models(https://arxiv.org/abs/2403.10615)
Keywords: diffusion, generative
Abstract: We introduce LightIt, a method for explicit illumination control for image generation. Recent generative methods lack lighting control, which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations, we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading, which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then, we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally, we use our generated dataset to train an identity-preserving relighting model, conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.

Title: IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

Authors: Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10701
Pdf URL: https://arxiv.org/pdf/2403.10701
Copy Paste: [[2403.10701]] IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation(https://arxiv.org/abs/2403.10701)
Keywords: diffusion, generative
Abstract: Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.

Title: Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

Authors: Anton Pelykh, Ozge Mercanoglu Sincan, Richard Bowden
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10731
Pdf URL: https://arxiv.org/pdf/2403.10731
Copy Paste: [[2403.10731]] Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation(https://arxiv.org/abs/2403.10731)
Keywords: diffusion
Abstract: Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body out-painting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the out-painted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code of the proposed approach is available at https://github.com/apelykh/hand-to-diffusion.

Title: StableGarment: Garment-Centric Generation via Stable Diffusion

Authors: Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, Peipei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10783
Pdf URL: https://arxiv.org/pdf/2403.10783
Copy Paste: [[2403.10783]] StableGarment: Garment-Centric Generation via Stable Diffusion(https://arxiv.org/abs/2403.10783)
Keywords: diffusion
Abstract: In this paper, we introduce StableGarment, a unified framework to tackle garment-centric(GC) generation tasks, including GC text-to-image, controllable GC text-to-image, stylized GC text-to-image, and robust virtual try-on. The main challenge lies in retaining the intricate textures of the garment while maintaining the flexibility of pre-trained Stable Diffusion. Our solution involves the development of a garment encoder, a trainable copy of the denoising UNet equipped with additive self-attention (ASA) layers. These ASA layers are specifically devised to transfer detailed garment textures, also facilitating the integration of stylized base models for the creation of stylized images. Furthermore, the incorporation of a dedicated try-on ControlNet enables StableGarment to execute virtual try-on tasks with precision. We also build a novel data engine that produces high-quality synthesized data to preserve the model's ability to follow prompts. Extensive experiments demonstrate that our approach delivers state-of-the-art (SOTA) results among existing virtual try-on methods and exhibits high flexibility with broad potential applications in various garment-centric image generation.

Title: Time Series Representation Learning with Supervised Contrastive Temporal Transformer

Authors: Yuansan Liu, Sudanthi Wijewickrema, Christofer Bester, Stephen O'Leary, James Bailey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10787
Pdf URL: https://arxiv.org/pdf/2403.10787
Copy Paste: [[2403.10787]] Time Series Representation Learning with Supervised Contrastive Temporal Transformer(https://arxiv.org/abs/2403.10787)
Keywords: self-supervised
Abstract: Finding effective representations for time series data is a useful but challenging task. Several works utilize self-supervised or unsupervised learning methods to address this. However, there still remains the open question of how to leverage available label information for better representations. To answer this question, we exploit pre-existing techniques in time series and representation learning domains and develop a simple, yet novel fusion model, called: \textbf{S}upervised \textbf{CO}ntrastive \textbf{T}emporal \textbf{T}ransformer (SCOTT). We first investigate suitable augmentation methods for various types of time series data to assist with learning change-invariant representations. Secondly, we combine Transformer and Temporal Convolutional Networks in a simple way to efficiently learn both global and local features. Finally, we simplify Supervised Contrastive Loss for representation learning of labelled time series data. We preliminarily evaluate SCOTT on a downstream task, Time Series Classification, using 45 datasets from the UCR archive. The results show that with the representations learnt by SCOTT, even a weak classifier can perform similar to or better than existing state-of-the-art models (best performance on 23/45 datasets and highest rank against 9 baseline models). Afterwards, we investigate SCOTT's ability to address a real-world task, online Change Point Detection (CPD), on two datasets: a human activity dataset and a surgical patient dataset. We show that the model performs with high reliability and efficiency on the online CPD problem ($\sim$98\% and $\sim$97\% area under precision-recall curve respectively). Furthermore, we demonstrate the model's potential in tackling early detection and show it performs best compared to other candidates.

Title: Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Authors: Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Haoye Dong, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10799
Pdf URL: https://arxiv.org/pdf/2403.10799
Copy Paste: [[2403.10799]] Efficient Pruning of Large Language Model with Adaptive Estimation Fusion(https://arxiv.org/abs/2403.10799)
Keywords: generative
Abstract: Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

Title: Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples

Authors: Ziqi Zhou, Minghui Li, Wei Liu, Shengshan Hu, Yechao Zhang, Wei Wan, Lulu Xue, Leo Yu Zhang, Dezhong Yang, Hai Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10801
Pdf URL: https://arxiv.org/pdf/2403.10801
Copy Paste: [[2403.10801]] Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples(https://arxiv.org/abs/2403.10801)
Keywords: self-supervised
Abstract: With the evolution of self-supervised learning, the pre-training paradigm has emerged as a predominant solution within the deep learning landscape. Model providers furnish pre-trained encoders designed to function as versatile feature extractors, enabling downstream users to harness the benefits of expansive models with minimal effort through fine-tuning. Nevertheless, recent works have exposed a vulnerability in pre-trained encoders, highlighting their susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously crafted by attackers. The lingering question pertains to the feasibility of fortifying the robustness of downstream models against DAEs, particularly in scenarios where the pre-trained encoders are publicly accessible to the attackers. In this paper, we initially delve into existing defensive mechanisms against adversarial examples within the pre-training paradigm. Our findings reveal that the failure of current defenses stems from the domain shift between pre-training data and downstream tasks, as well as the sensitivity of encoder parameters. In response to these challenges, we propose Genetic Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our extensive experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.

Title: Anomaly Detection Based on Isolation Mechanisms: A Survey

Authors: Yang Cao, Haolong Xiang, Hang Zhang, Ye Zhu, Kai Ming Ting
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.10802
Pdf URL: https://arxiv.org/pdf/2403.10802
Copy Paste: [[2403.10802]] Anomaly Detection Based on Isolation Mechanisms: A Survey(https://arxiv.org/abs/2403.10802)
Keywords: anomaly
Abstract: Anomaly detection is a longstanding and active research area that has many applications in domains such as finance, security, and manufacturing. However, the efficiency and performance of anomaly detection algorithms are challenged by the large-scale, high-dimensional, and heterogeneous data that are prevalent in the era of big data. Isolation-based unsupervised anomaly detection is a novel and effective approach for identifying anomalies in data. It relies on the idea that anomalies are few and different from normal instances, and thus can be easily isolated by random partitioning. Isolation-based methods have several advantages over existing methods, such as low computational complexity, low memory usage, high scalability, robustness to noise and irrelevant features, and no need for prior knowledge or heavy parameter tuning. In this survey, we review the state-of-the-art isolation-based anomaly detection methods, including their data partitioning strategies, anomaly score functions, and algorithmic details. We also discuss some extensions and applications of isolation-based methods in different scenarios, such as detecting anomalies in streaming data, time series, trajectory, and image datasets. Finally, we identify some open challenges and future directions for isolation-based anomaly detection research.

Title: Active Label Correction for Semantic Segmentation with Foundation Models

Authors: Hoyoung Kim, Sehyun Hwang, Suha Kwak, Jungseul Ok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10820
Pdf URL: https://arxiv.org/pdf/2403.10820
Copy Paste: [[2403.10820]] Active Label Correction for Semantic Segmentation with Foundation Models(https://arxiv.org/abs/2403.10820)
Keywords: foundation model
Abstract: Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels, which in turn is more annotator-friendly than the standard one inquiring to classify a pixel directly according to our theoretical analysis and user study. Specifically, leveraging foundation models providing useful zero-shot predictions on pseudo labels and superpixels, our method comprises two key techniques: (i) an annotator-friendly design of correction query with the pseudo labels, and (ii) an acquisition function looking ahead label expansions based on the superpixels. Experimental results on PASCAL, Cityscapes, and Kvasir-SEG datasets demonstrate the effectiveness of our ALC framework, outperforming prior methods for active semantic segmentation and label correction. Notably, utilizing our method, we obtained a revised dataset of PASCAL by rectifying errors in 2.6 million pixels in PASCAL dataset.

Title: VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis

Authors: Hao Wei, Bowen Liu, Minqing Zhang, Peilun Shi, Wu Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10823
Pdf URL: https://arxiv.org/pdf/2403.10823
Copy Paste: [[2403.10823]] VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis(https://arxiv.org/abs/2403.10823)
Keywords: foundation model
Abstract: Generalist foundation model has ushered in newfound capabilities in medical domain. However, the contradiction between the growing demand for high-quality annotated data with patient privacy continues to intensify. The utilization of medical artificial intelligence generated content (Med-AIGC) as an inexhaustible resource repository arises as a potential solution to address the aforementioned challenge. Here we harness 1 million open-source synthetic fundus images paired with natural language descriptions, to curate an ethical language-image foundation model for retina image analysis named VisionCLIP. VisionCLIP achieves competitive performance on three external datasets compared with the existing method pre-trained on real-world data in a zero-shot fashion. The employment of artificially synthetic images alongside corresponding textual data for training enables the medical foundation model to successfully assimilate knowledge of disease symptomatology, thereby circumventing potential breaches of patient confidentiality.

Title: DUE: Dynamic Uncertainty-Aware Explanation Supervision via 3D Imputation

Authors: Qilong Zhao, Yifei Zhang, Mengdan Zhu, Siyi Gu, Yuyang Gao, Xiaofeng Yang, Liang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10831
Pdf URL: https://arxiv.org/pdf/2403.10831
Copy Paste: [[2403.10831]] DUE: Dynamic Uncertainty-Aware Explanation Supervision via 3D Imputation(https://arxiv.org/abs/2403.10831)
Keywords: diffusion
Abstract: Explanation supervision aims to enhance deep learning models by integrating additional signals to guide the generation of model explanations, showcasing notable improvements in both the predictability and explainability of the model. However, the application of explanation supervision to higher-dimensional data, such as 3D medical images, remains an under-explored domain. Challenges associated with supervising visual explanations in the presence of an additional dimension include: 1) spatial correlation changed, 2) lack of direct 3D annotations, and 3) uncertainty varies across different parts of the explanation. To address these challenges, we propose a Dynamic Uncertainty-aware Explanation supervision (DUE) framework for 3D explanation supervision that ensures uncertainty-aware explanation guidance when dealing with sparsely annotated 3D data with diffusion-based 3D interpolation. Our proposed framework is validated through comprehensive experiments on diverse real-world medical imaging datasets. The results demonstrate the effectiveness of our framework in enhancing the predictability and explainability of deep learning models in the context of medical imaging diagnosis applications.

Title: Just Say the Name: Online Continual Learning with Category Names Only via Data Generation

Authors: Minhyuk Seo, Diganta Misra, Seongwon Cho, Minjae Lee, Jonghyun Choi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2403.10853
Pdf URL: https://arxiv.org/pdf/2403.10853
Copy Paste: [[2403.10853]] Just Say the Name: Online Continual Learning with Category Names Only via Data Generation(https://arxiv.org/abs/2403.10853)
Keywords: generative
Abstract: In real-world scenarios, extensive manual annotation for continual learning is impractical due to prohibitive costs. Although prior arts, influenced by large-scale webly supervised training, suggest leveraging web-scraped data in continual learning, this poses challenges such as data imbalance, usage restrictions, and privacy concerns. Addressing the risks of continual webly supervised training, we present an online continual learning framework - Generative Name only Continual Learning (G-NoCL). The proposed G-NoCL uses a set of generators G along with the learner. When encountering new concepts (i.e., classes), G-NoCL employs the novel sample complexity-guided data ensembling technique DIverSity and COmplexity enhancing ensemBlER (DISCOBER) to optimally sample training data from generated data. Through extensive experimentation, we demonstrate superior performance of DISCOBER in G-NoCL online CL benchmarks, covering both In-Distribution (ID) and Out-of-Distribution (OOD) generalization evaluations, compared to naive generator-ensembling, web-supervised, and manually annotated data.

Title: A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Authors: Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10854
Pdf URL: https://arxiv.org/pdf/2403.10854
Copy Paste: [[2403.10854]] A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment(https://arxiv.org/abs/2403.10854)
Keywords: in-context
Abstract: While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) in both full-reference and no-reference scenarios. Experimental results show that only the close-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

Title: Zero-shot Generative Linguistic Steganography

Authors: Ke Lin, Yiyang Luo, Zijian Zhang, Ping Luo
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10856
Pdf URL: https://arxiv.org/pdf/2403.10856
Copy Paste: [[2403.10856]] Zero-shot Generative Linguistic Steganography(https://arxiv.org/abs/2403.10856)
Keywords: generative, in-context
Abstract: Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces $1.926\times$ more innocent and intelligible stegotext than any other method.

Title: A Watermark-Conditioned Diffusion Model for IP Protection

Authors: Rui Min, Sen Li, Hongyang Chen, Minhao Cheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.10893
Pdf URL: https://arxiv.org/pdf/2403.10893
Copy Paste: [[2403.10893]] A Watermark-Conditioned Diffusion Model for IP Protection(https://arxiv.org/abs/2403.10893)
Keywords: diffusion, generative
Abstract: The ethical need to protect AI-generated content has been a significant concern in recent years. While existing watermarking strategies have demonstrated success in detecting synthetic content (detection), there has been limited exploration in identifying the users responsible for generating these outputs from a single model (owner identification). In this paper, we focus on both practical scenarios and propose a unified watermarking framework for content copyright protection within the context of diffusion models. Specifically, we consider two parties: the model provider, who grants public access to a diffusion model via an API, and the users, who can solely query the model API and generate images in a black-box manner. Our task is to embed hidden information into the generated contents, which facilitates further detection and owner identification. To tackle this challenge, we propose a Watermark-conditioned Diffusion model called WaDiff, which manipulates the watermark as a conditioned input and incorporates fingerprinting into the generation process. All the generative outputs from our WaDiff carry user-specific information, which can be recovered by an image extractor and further facilitate forensic identification. Extensive experiments are conducted on two popular diffusion models, and we demonstrate that our method is effective and robust in both the detection and owner identification tasks. Meanwhile, our watermarking framework only exerts a negligible impact on the original generation and is more stealthy and efficient in comparison to existing watermarking strategies.

Title: DTOR: Decision Tree Outlier Regressor to explain anomalies

Authors: Riccardo Crupi, Alessandro Damiano Sabatino, Immacolata Marano, Massimiliano Brinis, Luca Albertazzi, Andrea Cirillo, Andrea Claudio Cosentini
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2403.10903
Pdf URL: https://arxiv.org/pdf/2403.10903
Copy Paste: [[2403.10903]] DTOR: Decision Tree Outlier Regressor to explain anomalies(https://arxiv.org/abs/2403.10903)
Keywords: anomaly
Abstract: Explaining outliers occurrence and mechanism of their occurrence can be extremely important in a variety of domains. Malfunctions, frauds, threats, in addition to being correctly identified, oftentimes need a valid explanation in order to effectively perform actionable counteracts. The ever more widespread use of sophisticated Machine Learning approach to identify anomalies make such explanations more challenging. We present the Decision Tree Outlier Regressor (DTOR), a technique for producing rule-based explanations for individual data points by estimating anomaly scores generated by an anomaly detection model. This is accomplished by first applying a Decision Tree Regressor, which computes the estimation score, and then extracting the relative path associated with the data point score. Our results demonstrate the robustness of DTOR even in datasets with a large number of features. Additionally, in contrast to other rule-based approaches, the generated rules are consistently satisfied by the points to be explained. Furthermore, our evaluation metrics indicate comparable performance to Anchors in outlier explanation tasks, with reduced execution time.

Title: Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Authors: Yeongtak Oh, Jonghyun Lee, Jooyoung Choi, Dahuin Jung, Uiwon Hwang, Sungroh Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10911
Pdf URL: https://arxiv.org/pdf/2403.10911
Copy Paste: [[2403.10911]] Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation(https://arxiv.org/abs/2403.10911)
Keywords: diffusion
Abstract: Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, both performance and, memory and time consumption serve as crucial considerations. A recent diffusion-based TTA approach for restoring corrupted images involves image-level updates. However, using pixel space diffusion significantly increases resource requirements compared to conventional model updating TTA approaches, revealing limitations as a TTA method. To address this, we propose a novel TTA method by leveraging a latent diffusion model (LDM) based image editing model and fine-tuning it with our newly introduced corruption modeling scheme. This scheme enhances the robustness of the diffusion model against distribution shifts by creating (clean, corrupted) image pairs and fine-tuning the model to edit corrupted images into clean ones. Moreover, we introduce a distilled variant to accelerate the model for corruption editing using only 4 network function evaluations (NFEs). We extensively validated our method across various architectures and datasets including image and video domains. Our model achieves the best performance with a 100 times faster runtime than that of a diffusion-based baseline. Furthermore, it outpaces the speed of the model updating TTA method based on data augmentation threefold, rendering an image-level updating approach more practical.

Title: Interpretable Machine Learning for TabPFN

Authors: David Rundel, Julius Kobialka, Constantin von Crailsheim, Matthias Feurer, Thomas Nagler, David Rügamer
Subjects: cs.LG, cs.AI, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2403.10923
Pdf URL: https://arxiv.org/pdf/2403.10923
Copy Paste: [[2403.10923]] Interpretable Machine Learning for TabPFN(https://arxiv.org/abs/2403.10923)
Keywords: in-context
Abstract: The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN. Our proposed methods are implemented in a package tabpfn_iml and made available at https://github.com/david-rundel/tabpfn_iml.

Title: ScanTalk: 3D Talking Heads from Unregistered Scans

Authors: Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Stefano Berretti, Mohamed Daoudi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10942
Pdf URL: https://arxiv.org/pdf/2403.10942
Copy Paste: [[2403.10942]] ScanTalk: 3D Talking Heads from Unregistered Scans(https://arxiv.org/abs/2403.10942)
Keywords: diffusion
Abstract: Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present ScanTalk, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data. Our approach relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations. By leveraging the power of DiffusionNet, ScanTalk not only adapts to diverse facial structures but also maintains fidelity when dealing with scanned data, thereby enhancing the authenticity and versatility of generated 3D talking heads. Through comprehensive comparisons with state-of-the-art methods, we validate the efficacy of our approach, demonstrating its capacity to generate realistic talking heads comparable to existing techniques. While our primary objective is to develop a generic method free from topological constraints, all state-of-the-art methodologies are bound by such limitations. Code for reproducing our results, and the pre-trained model will be made available.

Title: Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

Authors: Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10953
Pdf URL: https://arxiv.org/pdf/2403.10953
Copy Paste: [[2403.10953]] Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription(https://arxiv.org/abs/2403.10953)
Keywords: diffusion
Abstract: Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.

Title: Energy-Based Models with Applications to Speech and Language Processing

Authors: Zhijian Ou
Subjects: cs.LG, cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.10961
Pdf URL: https://arxiv.org/pdf/2403.10961
Copy Paste: [[2403.10961]] Energy-Based Models with Applications to Speech and Language Processing(https://arxiv.org/abs/2403.10961)
Keywords: generative
Abstract: Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.

Title: Exploiting Topological Prior for Boosting Point Cloud Generation

Authors: Baiyuan Chen
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.10962
Pdf URL: https://arxiv.org/pdf/2403.10962
Copy Paste: [[2403.10962]] Exploiting Topological Prior for Boosting Point Cloud Generation(https://arxiv.org/abs/2403.10962)
Keywords: generative
Abstract: This paper presents an innovative enhancement to the Sphere as Prior Generative Adversarial Network (SP-GAN) model, a state-of-the-art GAN designed for point cloud generation. A novel method is introduced for point cloud generation that elevates the structural integrity and overall quality of the generated point clouds by incorporating topological priors into the training process of the generator. Specifically, this work utilizes the K-means algorithm to segment a point cloud from the repository into clusters and extract centroids, which are then used as priors in the generation process of the SP-GAN. Furthermore, the discriminator component of the SP-GAN utilizes the identical point cloud that contributed the centroids, ensuring a coherent and consistent learning environment. This strategic use of centroids as intuitive guides not only boosts the efficiency of global feature learning but also substantially improves the structural coherence and fidelity of the generated point clouds. By applying the K-means algorithm to generate centroids as the prior, the work intuitively and experimentally demonstrates that such a prior enhances the quality of generated point clouds.

Title: Task-Aware Low-Rank Adaptation of Segment Anything Model

Authors: Xuehao Wang, Feiyang Ye, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10971
Pdf URL: https://arxiv.org/pdf/2403.10971
Copy Paste: [[2403.10971]] Task-Aware Low-Rank Adaptation of Segment Anything Model(https://arxiv.org/abs/2403.10971)
Keywords: foundation model
Abstract: The Segment Anything Model (SAM), with its remarkable zero-shot capability, has been proven to be a powerful foundation model for image segmentation tasks, which is an important task in computer vision. However, the transfer of its rich semantic information to multiple different downstream tasks remains unexplored. In this paper, we propose the Task-Aware Low-Rank Adaptation (TA-LoRA) method, which enables SAM to work as a foundation model for multi-task learning. Specifically, TA-LoRA injects an update parameter tensor into each layer of the encoder in SAM and leverages a low-rank tensor decomposition method to incorporate both task-shared and task-specific information. Furthermore, we introduce modified SAM (mSAM) for multi-task learning where we remove the prompt encoder of SAM and use task-specific no mask embeddings and mask decoder for each task. Extensive experiments conducted on benchmark datasets substantiate the efficacy of TA-LoRA in enhancing the performance of mSAM across multiple downstream tasks.

Title: OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Authors: Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10983
Pdf URL: https://arxiv.org/pdf/2403.10983
Copy Paste: [[2403.10983]] OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models(https://arxiv.org/abs/2403.10983)
Keywords: diffusion
Abstract: Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foreground and background. In this work, we propose OMG, an occlusion-friendly personalized generation framework designed to seamlessly integrate multiple concepts within a single image. We propose a novel two-stage sampling solution. The first stage takes charge of layout generation and visual comprehension information collection for handling occlusions. The second one utilizes the acquired visual comprehension information and the designed noise blending to integrate multiple concepts while considering occlusions. We also observe that the initiation denoising timestep for noise blending is the key to identity preservation and layout. Moreover, our method can be combined with various single-concept models, such as LoRA and InstantID without additional tuning. Especially, LoRA models on civitai.com can be exploited directly. Extensive experiments demonstrate that OMG exhibits superior performance in multi-concept personalization.

Title: Boosting Flow-based Generative Super-Resolution Models via Learned Prior

Authors: Li-Yuan Tsao, Yi-Chen Lo, Chia-Che Chang, Hao-Wei Chen, Roy Tseng, Chien Feng, Chun-Yi Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10988
Pdf URL: https://arxiv.org/pdf/2403.10988
Copy Paste: [[2403.10988]] Boosting Flow-based Generative Super-Resolution Models via Learned Prior(https://arxiv.org/abs/2403.10988)
Keywords: generative
Abstract: Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However, these methods encounter several challenges during image generation, such as grid artifacts, exploding inverses, and suboptimal results due to a fixed sampling temperature. To overcome these issues, this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image, which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: https://github.com/liyuantsao/FlowSR-LP

Title: Neuro-Symbolic Video Search

Authors: Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, Sandeep Chinchali
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11021
Pdf URL: https://arxiv.org/pdf/2403.11021
Copy Paste: [[2403.11021]] Neuro-Symbolic Video Search(https://arxiv.org/abs/2403.11021)
Keywords: foundation model
Abstract: The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.

Title: Reward Guided Latent Consistency Distillation

Authors: Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11027
Pdf URL: https://arxiv.org/pdf/2403.11027
Copy Paste: [[2403.11027]] Reward Guided Latent Consistency Distillation(https://arxiv.org/abs/2403.11027)
Keywords: diffusion
Abstract: Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM.

Title: Endora: Video Generation Models as Endoscopy Simulators

Authors: Chenxin Li, Hengyu Liu, Yifan Liu, Brandon Y. Feng, Wuyang Li, Xinyu Liu, Zhen Chen, Jing Shao, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11050
Pdf URL: https://arxiv.org/pdf/2403.11050
Copy Paste: [[2403.11050]] Endora: Video Generation Models as Endoscopy Simulators(https://arxiv.org/abs/2403.11050)
Keywords: foundation model, generative
Abstract: Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for machine learning. Despite progress in generating 2D medical images, the complex domain of clinical video generation has largely remained untapped.This paper introduces \model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes. We present a novel generative model design that integrates a meticulously crafted spatial-temporal video transformer with advanced 2D vision foundation model priors, explicitly modeling spatial-temporal dynamics during video generation. We also pioneer the first public benchmark for endoscopy simulation with video generation models, adapting existing state-of-the-art methods for this endeavor.Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing. Moreover, we explore how this endoscopy simulator can empower downstream video analysis tasks and even generate 3D medical scenes with multi-view consistency. In a nutshell, Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation. For more details, please visit our project page: https://endora-medvidgen.github.io/.

Title: Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Authors: Jie Ren, Yaxin Li, Shenglai Zen, Han Xu, Lingjuan Lyu, Yue Xing, Jiliang Tang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2403.11052
Pdf URL: https://arxiv.org/pdf/2403.11052
Copy Paste: [[2403.11052]] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention(https://arxiv.org/abs/2403.11052)
Keywords: diffusion
Abstract: Recent advancements in text-to-image diffusion models have demonstrated their remarkable capability to generate high-quality images from textual prompts. However, increasing research indicates that these models memorize and replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. In our study, we provide a novel perspective to understand this memorization phenomenon by examining its relationship with cross-attention mechanisms. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. The diffusion model is overfitted to these token embeddings, memorizing corresponding training images. To elucidate this phenomenon, we further identify and discuss various intrinsic findings of cross-attention that contribute to memorization. Building on these insights, we introduce an innovative approach to detect and mitigate memorization in diffusion models. The advantage of our proposed method is that it will not compromise the speed of either the training or the inference processes in these models while preserving the quality of generated images. Our code is available at https://github.com/renjie3/MemAttn .

Title: Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model

Authors: Kangyang Xie, Binbin Yang, Hao Chen, Meng Wang, Cheng Zou, Hui Xue, Ming Yang, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11077
Pdf URL: https://arxiv.org/pdf/2403.11077
Copy Paste: [[2403.11077]] Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model(https://arxiv.org/abs/2403.11077)
Keywords: diffusion, generative
Abstract: Beyond the superiority of the text-to-image diffusion model in generating high-quality images, recent studies have attempted to uncover its potential for adapting the learned semantic knowledge to visual perception tasks. In this work, instead of translating a generative diffusion model into a visual perception model, we explore to retain the generative ability with the perceptive adaptation. To accomplish this, we present Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes. By alternatively selecting one modality as the condition and then applying the diffusion process to the counterpart modality, Zippo is capable of generating RGB images from alpha mattes and predicting transparency from input images. In addition to single-modality prediction, we propose a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance. Our experiments showcase Zippo's ability of efficient text-conditioned transparent image generation and present plausible results of Matte-to-RGB and RGB-to-Matte translation.

Title: RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning

Authors: Javad Rafiei Asl, Prajwal Panzade, Eduardo Blanco, Daniel Takabi, Zhipeng Cai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11082
Pdf URL: https://arxiv.org/pdf/2403.11082
Copy Paste: [[2403.11082]] RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning(https://arxiv.org/abs/2403.11082)
Keywords: self-supervised
Abstract: Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51\% to 38.81\%). The framework also yields improvements of 1.59\% and 0.23\% in semantic textual similarity tasks and various transfer tasks, respectively.

Title: Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

Authors: Xiaohao Xu, Yunkang Cao, Yongqi Chen, Weiming Shen, Xiaonan Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11083
Pdf URL: https://arxiv.org/pdf/2403.11083
Copy Paste: [[2403.11083]] Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning(https://arxiv.org/abs/2403.11083)
Keywords: foundation model, anomaly
Abstract: Anomaly detection is vital in various industrial scenarios, including the identification of unusual patterns in production lines and the detection of manufacturing defects for quality control. Existing techniques tend to be specialized in individual scenarios and lack generalization capacities. In this study, we aim to develop a generic anomaly detection model applicable across multiple scenarios. To achieve this, we customize generic visual-language foundation models that possess extensive knowledge and robust reasoning abilities into anomaly detectors and reasoners. Specifically, we introduce a multi-modal prompting strategy that incorporates domain knowledge from experts as conditions to guide the models. Our approach considers multi-modal prompt types, including task descriptions, class context, normality rules, and reference images. In addition, we unify the input representation of multi-modality into a 2D image format, enabling multi-modal anomaly detection and reasoning. Our preliminary studies demonstrate that combining visual and language prompts as conditions for customizing the models enhances anomaly detection performance. The customized models showcase the ability to detect anomalies across different data modalities such as images and point clouds. Qualitative case studies further highlight the anomaly detection and reasoning capabilities, particularly for multi-object scenes and temporal data. Our code is available at https://github.com/Xiaohao-Xu/Customizable-VLM.

Title: Incorporating Higher-order Structural Information for Graph Clustering

Authors: Qiankun Li, Haobing Liu, Ruobing Jiang, Tingting Wang
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2403.11087
Pdf URL: https://arxiv.org/pdf/2403.11087
Copy Paste: [[2403.11087]] Incorporating Higher-order Structural Information for Graph Clustering(https://arxiv.org/abs/2403.11087)
Keywords: self-supervised
Abstract: Clustering holds profound significance in data mining. In recent years, graph convolutional network (GCN) has emerged as a powerful tool for deep clustering, integrating both graph structural information and node attributes. However, most existing methods ignore the higher-order structural information of the graph. Evidently, nodes within the same cluster can establish distant connections. Besides, recent deep clustering methods usually apply a self-supervised module to monitor the training process of their model, focusing solely on node attributes without paying attention to graph structure. In this paper, we propose a novel graph clustering network to make full use of graph structural information. To capture the higher-order structural information, we design a graph mutual infomax module, effectively maximizing mutual information between graph-level and node-level representations, and employ a trinary self-supervised module that includes modularity as a structural constraint. Our proposed model outperforms many state-of-the-art methods on various datasets, demonstrating its superiority.

Title: Hierarchical Generative Network for Face Morphing Attacks

Authors: Zuyuan He, Zongyong Deng, Qiaoyun He, Qijun Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11101
Pdf URL: https://arxiv.org/pdf/2403.11101
Copy Paste: [[2403.11101]] Hierarchical Generative Network for Face Morphing Attacks(https://arxiv.org/abs/2403.11101)
Keywords: generative
Abstract: Face morphing attacks circumvent face recognition systems (FRSs) by creating a morphed image that contains multiple identities. However, existing face morphing attack methods either sacrifice image quality or compromise the identity preservation capability. Consequently, these attacks fail to bypass FRSs verification well while still managing to deceive human observers. These methods typically rely on global information from contributing images, ignoring the detailed information from effective facial regions. To address the above issues, we propose a novel morphing attack method to improve the quality of morphed images and better preserve the contributing identities. Our proposed method leverages the hierarchical generative network to capture both local detailed and global consistency information. Additionally, a mask-guided image blending module is dedicated to removing artifacts from areas outside the face to improve the image's visual quality. The proposed attack method is compared to state-of-the-art methods on three public datasets in terms of FRSs' vulnerability, attack detectability, and image quality. The results show our method's potential threat of deceiving FRSs while being capable of passing multiple morphing attack detection (MAD) scenarios.

Title: Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Authors: Ruibin Li, Ruihuang Li, Song Guo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11105
Pdf URL: https://arxiv.org/pdf/2403.11105
Copy Paste: [[2403.11105]] Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models(https://arxiv.org/abs/2403.11105)
Keywords: diffusion
Abstract: Text-driven diffusion models have significantly advanced the image editing performance by using text prompts as inputs. One crucial step in text-driven image editing is to invert the original image into a latent noise code conditioned on the source prompt. While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv), which aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. To make the inverted noise code be independent of the given source prompt as much as possible, we indicate that the iterative inversion process should satisfy a fixed-point constraint. Consequently, we transform the inversion problem into a searching problem to find the fixed-point solution, and utilize the pre-trained diffusion models to facilitate the searching process. The experimental results show that our proposed SPDInv method can effectively mitigate the conflicts between the target editing prompt and the source prompt, leading to a significant decrease in editing artifacts. In addition to text-driven image editing, with SPDInv we can easily adapt customized image generation models to localized editing tasks and produce promising performance. The source code are available at https://github.com/leeruibin/SPDInv.

Title: Self-Supervised Quantization-Aware Knowledge Distillation

Authors: Kaiqi Zhao, Ming Zhao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2403.11106
Pdf URL: https://arxiv.org/pdf/2403.11106
Copy Paste: [[2403.11106]] Self-Supervised Quantization-Aware Knowledge Distillation(https://arxiv.org/abs/2403.11106)
Keywords: self-supervised
Abstract: Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/SQAKD.git.

Title: Self-supervised co-salient object detection via feature correspondence at multiple scales

Authors: Souradeep Chakraborty, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11107
Pdf URL: https://arxiv.org/pdf/2403.11107
Copy Paste: [[2403.11107]] Self-supervised co-salient object detection via feature correspondence at multiple scales(https://arxiv.org/abs/2403.11107)
Keywords: self-supervised
Abstract: Our paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations. Unlike existing unsupervised methods that rely solely on patch-level information (e.g. clustering patch descriptors) or on computation heavy off-the-shelf components for CoSOD, our lightweight model leverages feature correspondences at both patch and region levels, significantly improving prediction performance. In the first stage, we train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images. We obtain the segmentation predictions using confidence-based adaptive thresholding. In the next stage, we refine these intermediate segmentations by eliminating the detected regions (within each image) whose averaged feature representations are dissimilar to the foreground feature representation averaged across all the cross-attention maps (from the previous stage). Extensive experiments on three CoSOD benchmark datasets show that our self-supervised model outperforms the corresponding state-of-the-art models by a huge margin (e.g. on the CoCA dataset, our model has a 13.7% F-measure gain over the SOTA unsupervised CoSOD model). Notably, our self-supervised model also outperforms several recent fully supervised CoSOD models on the three test datasets (e.g., on the CoCA dataset, our model has a 4.6% F-measure gain over a recent supervised CoSOD model).

Title: 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

Authors: Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11111
Pdf URL: https://arxiv.org/pdf/2403.11111
Copy Paste: [[2403.11111]] 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models(https://arxiv.org/abs/2403.11111)
Keywords: diffusion, generative
Abstract: In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.

Title: A Versatile Framework for Multi-scene Person Re-identification

Authors: Wei-Shi Zheng, Junkai Yan, Yi-Xing Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11121
Pdf URL: https://arxiv.org/pdf/2403.11121
Copy Paste: [[2403.11121]] A Versatile Framework for Multi-scene Person Re-identification(https://arxiv.org/abs/2403.11121)
Keywords: self-supervised
Abstract: Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes. Codes and models are available at https://github.com/iSEE-Laboratory/VersReID.

Title: Omni-Recon: Towards General-Purpose Neural Radiance Fields for Versatile 3D Applications

Authors: Yonggan Fu, Huaizhi Qu, Zhifan Ye, Chaojian Li, Kevin Zhao, Yingyan Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11131
Pdf URL: https://arxiv.org/pdf/2403.11131
Copy Paste: [[2403.11131]] Omni-Recon: Towards General-Purpose Neural Radiance Fields for Versatile 3D Applications(https://arxiv.org/abs/2403.11131)
Keywords: diffusion, foundation model
Abstract: Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing.

Title: Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model

Authors: Dian Zheng, Xiao-Ming Wu, Shuzhou Yang, Jian Zhang, Jian-Fang Hu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11157
Pdf URL: https://arxiv.org/pdf/2403.11157
Copy Paste: [[2403.11157]] Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model(https://arxiv.org/abs/2403.11157)
Keywords: diffusion
Abstract: Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (e.g., prompt) to guide the model to learn different distributions separately, named multi-partite mapping. However, it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work, we propose an advanced selective hourglass mapping strategy based on diffusion model, termed DiffUIR. Two novel considerations make our DiffUIR non-trivial. Firstly, we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model (selective). More importantly, DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally, which gradually maps different distributions into a shared one. In the reverse process, combined with SDT and strong condition guidance, DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality (hourglass). Without bells and whistles, by only modifying the mapping strategy, we achieve state-of-the-art performance on five image restoration tasks, 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly, by only using a lightweight model (only 0.89M), we could achieve outstanding performance. The source code and pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR

Title: CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion

Authors: Xiaoyu Wu, Yang Hua, Chumeng Liang, Jiaru Zhang, Hao Wang, Tao Song, Haibing Guan
Subjects: cs.CV, cs.AI, cs.CR, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11162
Pdf URL: https://arxiv.org/pdf/2403.11162
Copy Paste: [[2403.11162]] CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion(https://arxiv.org/abs/2403.11162)
Keywords: diffusion
Abstract: Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot generation where a pretrained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success, concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response, we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pretrained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image, which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication, surpassing alternative validation techniques. Code implementation is available at https://github.com/Nicholas0228/Revelio.

Title: Artifact Feature Purification for Cross-domain Detection of AI-generated Images

Authors: Zheling Meng, Bo Peng, Jing Dong, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11172
Pdf URL: https://arxiv.org/pdf/2403.11172
Copy Paste: [[2403.11172]] Artifact Feature Purification for Cross-domain Detection of AI-generated Images(https://arxiv.org/abs/2403.11172)
Keywords: diffusion
Abstract: In the era of AIGC, the fast development of visual content generation technologies, such as diffusion models, bring potential security risks to our society. Existing generated image detection methods suffer from performance drop when faced with out-of-domain generators and image scenes. To relieve this problem, we propose Artifact Purification Network (APN) to facilitate the artifact extraction from generated images through the explicit and implicit purification processes. For the explicit one, a suspicious frequency-band proposal method and a spatial feature decomposition method are proposed to extract artifact-related features. For the implicit one, a training strategy based on mutual information estimation is proposed to further purify the artifact-related features. Experiments show that for cross-generator detection, the average accuracy of APN is 5.6% ~ 16.4% higher than the previous 10 methods on GenImage dataset and 1.7% ~ 50.1% on DiffusionForensics dataset. For cross-scene detection, APN maintains its high performance. Via visualization analysis, we find that the proposed method extracts flexible forgery patterns and condenses the forgery information diluted in irrelevant features. We also find that the artifact features APN focuses on across generators and scenes are global and diverse. The code will be available on GitHub.

Title: Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11176
Pdf URL: https://arxiv.org/pdf/2403.11176
Copy Paste: [[2403.11176]] Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment(https://arxiv.org/abs/2403.11176)
Keywords: self-supervised
Abstract: No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require labeled MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the inherent quality of the images. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts, while guaranteeing consistent representations for images with comparable quality. Our method achieves state-of-the-art performance on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms supervised methods when their training dataset differs from the testing one, thus proving to be more suitable for real-world scenarios. Furthermore, our approach demonstrates greater robustness and improved explainability than competing methods. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.

Title: usfAD Based Effective Unknown Attack Detection Focused IDS Framework

Authors: Md. Ashraf Uddin, Sunil Aryal, Mohamed Reda Bouadjenek, Muna Al-Hawawreh, Md. Alamin Talukder
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11180
Pdf URL: https://arxiv.org/pdf/2403.11180
Copy Paste: [[2403.11180]] usfAD Based Effective Unknown Attack Detection Focused IDS Framework(https://arxiv.org/abs/2403.11180)
Keywords: anomaly
Abstract: The rapid expansion of varied network systems, including the Internet of Things (IoT) and Industrial Internet of Things (IIoT), has led to an increasing range of cyber threats. Ensuring robust protection against these threats necessitates the implementation of an effective Intrusion Detection System (IDS). For more than a decade, researchers have delved into supervised machine learning techniques to develop IDS to classify normal and attack traffic. However, building effective IDS models using supervised learning requires a substantial number of benign and attack samples. To collect a sufficient number of attack samples from real-life scenarios is not possible since cyber attacks occur occasionally. Further, IDS trained and tested on known datasets fails in detecting zero-day or unknown attacks due to the swift evolution of attack patterns. To address this challenge, we put forth two strategies for semi-supervised learning based IDS where training samples of attacks are not required: 1) training a supervised machine learning model using randomly and uniformly dispersed synthetic attack samples; 2) building a One Class Classification (OCC) model that is trained exclusively on benign network traffic. We have implemented both approaches and compared their performances using 10 recent benchmark IDS datasets. Our findings demonstrate that the OCC model based on the state-of-art anomaly detection technique called usfAD significantly outperforms conventional supervised classification and other OCC based techniques when trained and tested considering real-life scenarios, particularly to detect previously unseen attacks.

Title: Self-Supervised Video Desmoking for Laparoscopic Surgery

Authors: Renlong Wu, Zhilu Zhang, Shuohao Zhang, Longfei Gou, Haobin Chen, Lei Zhang, Hao Chen, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11192
Pdf URL: https://arxiv.org/pdf/2403.11192
Copy Paste: [[2403.11192]] Self-Supervised Video Desmoking for Laparoscopic Surgery(https://arxiv.org/abs/2403.11192)
Keywords: self-supervised
Abstract: Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self-supervised surgery video desmoking (SelfSVD). On the one hand, we observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame), thus it can serve as supervision for other smoky frames, making real-world self-supervised video desmoking practically feasible. On the other hand, in order to enhance the desmoking performance, we further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions. In addition, we construct a real surgery video dataset for desmoking, which covers a variety of smoky scenes. Extensive experiments on the dataset show that our SelfSVD can remove smoke more effectively and efficiently while recovering more photo-realistic details than the state-of-the-art methods. The dataset, codes, and pre-trained models are available at \url{https://github.com/ZcsrenlongZ/SelfSVD}.

Title: MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

Authors: Yasufumi Kawano, Yoshimitsu Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11194
Pdf URL: https://arxiv.org/pdf/2403.11194
Copy Paste: [[2403.11194]] MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation(https://arxiv.org/abs/2403.11194)
Keywords: diffusion
Abstract: Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at https://github.com/Valkyrja3607/MaskDiffusion.

Title: MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Authors: Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, Tanishq Mathew Abraham
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2403.11207
Pdf URL: https://arxiv.org/pdf/2403.11207
Copy Paste: [[2403.11207]] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data(https://arxiv.org/abs/2403.11207)
Keywords: diffusion
Abstract: Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.

Title: THOR: Text to Human-Object Interaction Diffusion via Relation Intervention

Authors: Qianyang Wu, Ye Shi, Xiaoshui Huang, Jingyi Yu, Lan Xu, Jingya Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11208
Pdf URL: https://arxiv.org/pdf/2403.11208
Copy Paste: [[2403.11208]] THOR: Text to Human-Object Interaction Diffusion via Relation Intervention(https://arxiv.org/abs/2403.11208)
Keywords: diffusion
Abstract: This paper addresses new methodologies to deal with the challenging task of generating dynamic Human-Object Interactions from textual descriptions (Text2HOI). While most existing works assume interactions with limited body parts or static objects, our task involves addressing the variation in human motion, the diversity of object shapes, and the semantic vagueness of object motion simultaneously. To tackle this, we propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR). THOR is a cohesive diffusion model equipped with a relation intervention mechanism. In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. This intervention enhances the spatial-temporal relations between humans and objects, with human-centric interaction representation providing additional guidance for synthesizing consistent motion from text. To achieve more reasonable and realistic results, interaction losses is introduced at different levels of motion granularity. Moreover, we construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset. Both quantitative and qualitative experiments demonstrate the effectiveness of our proposed model.

Title: Understanding Diffusion Models by Feynman's Path Integral

Authors: Yuji Hirono, Akinori Tanaka, Kenji Fukushima
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, hep-th
Abstract URL: https://arxiv.org/abs/2403.11262
Pdf URL: https://arxiv.org/pdf/2403.11262
Copy Paste: [[2403.11262]] Understanding Diffusion Models by Feynman's Path Integral(https://arxiv.org/abs/2403.11262)
Keywords: diffusion, generative
Abstract: Score-based diffusion models have proven effective in image generation and have gained widespread usage; however, the underlying factors contributing to the performance disparity between stochastic and deterministic (i.e., the probability flow ODEs) sampling schemes remain unclear. We introduce a novel formulation of diffusion models using Feynman's path integral, which is a formulation originally developed for quantum physics. We find this formulation providing comprehensive descriptions of score-based generative models, and demonstrate the derivation of backward stochastic differential equations and loss functions.The formulation accommodates an interpolating parameter connecting stochastic and deterministic sampling schemes, and we identify this parameter as a counterpart of Planck's constant in quantum physics. This analogy enables us to apply the Wentzel-Kramers-Brillouin (WKB) expansion, a well-established technique in quantum physics, for evaluating the negative log-likelihood to assess the performance disparity between stochastic and deterministic sampling schemes.

Title: Stylized Face Sketch Extraction via Generative Prior with Limited Data

Authors: Kwan Yun, Kwanggyoon Seo, Chang Wook Seo, Soyeon Yoon, Seongcheol Kim, Soohyun Ji, Amirsaman Ashtari, Junyong Noh
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.11263
Pdf URL: https://arxiv.org/pdf/2403.11263
Copy Paste: [[2403.11263]] Stylized Face Sketch Extraction via Generative Prior with Limited Data(https://arxiv.org/abs/2403.11263)
Keywords: generative
Abstract: Facial sketches are both a concise way of showing the identity of a person and a means to express artistic intention. While a few techniques have recently emerged that allow sketches to be extracted in different styles, they typically rely on a large amount of data that is difficult to obtain. Here, we propose StyleSketch, a method for extracting high-resolution stylized sketches from a face image. Using the rich semantics of the deep features from a pretrained StyleGAN, we are able to train a sketch generator with 16 pairs of face and the corresponding sketch images. The sketch generator utilizes part-based losses with two-stage learning for fast convergence during training for high-quality sketch extraction. Through a set of comparisons, we show that StyleSketch outperforms existing state-of-the-art sketch extraction methods and few-shot image adaptation methods for the task of extracting high-resolution abstract face sketches. We further demonstrate the versatility of StyleSketch by extending its use to other domains and explore the possibility of semantic editing. The project page can be found in https://kwanyun.github.io/stylesketch_project.

Title: Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation

Authors: Silvia Corbara, Alejandro Moreo
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11265
Pdf URL: https://arxiv.org/pdf/2403.11265
Copy Paste: [[2403.11265]] Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation(https://arxiv.org/abs/2403.11265)
Keywords: generative
Abstract: Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author. In this paper, we investigate the potential benefits of augmenting the classifier training set with (negative) synthetic examples. These synthetic examples are generated to imitate the style of the author of interest. We analyze the improvements in classifier prediction that this augmentation brings to bear in the task of AV in an adversarial setting. In particular, we experiment with three different generator architectures (one based on Recurrent Neural Networks, another based on small-scale transformers, and another based on the popular GPT model) and with two training strategies (one inspired by standard Language Models, and another inspired by Wasserstein Generative Adversarial Networks). We evaluate our hypothesis on five datasets (three of which have been specifically collected to represent an adversarial setting) and using two learning algorithms for the AV classifier (Support Vector Machines and Convolutional Neural Networks). This experimentation has yielded negative results, revealing that, although our methodology proves effective in many adversarial settings, its benefits are too sporadic for a pragmatical application.

Title: BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis

Authors: Lutao Jiang, Lin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11273
Pdf URL: https://arxiv.org/pdf/2403.11273
Copy Paste: [[2403.11273]] BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis(https://arxiv.org/abs/2403.11273)
Keywords: generative
Abstract: Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image models with 3D representation methods, e.g., Gaussian Splatting (GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to one-stage generation for any unseen text prompts, which yet remains challenging. A hurdle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end single-stage approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH coefficient), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the triplane feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The project code is available at https://vlislab22.github.io/BrightDreamer.

Title: Fast Personalized Text-to-Image Syntheses With Attention Injection

Authors: Yuxuan Zhang, Yiren Song, Jinpeng Yu, Han Pan, Zhongliang Jing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11284
Pdf URL: https://arxiv.org/pdf/2403.11284
Copy Paste: [[2403.11284]] Fast Personalized Text-to-Image Syntheses With Attention Injection(https://arxiv.org/abs/2403.11284)
Keywords: diffusion
Abstract: Currently, personalized image generation methods mostly require considerable time to finetune and often overfit the concept resulting in generated images that are similar to custom concepts but difficult to edit by prompts. We propose an effective and fast approach that could balance the text-image consistency and identity consistency of the generated image and reference image. Our method can generate personalized images without any fine-tuning while maintaining the inherent text-to-image generation ability of diffusion models. Given a prompt and a reference image, we merge the custom concept into generated images by manipulating cross-attention and self-attention layers of the original diffusion model to generate personalized images that match the text description. Comprehensive experiments highlight the superiority of our method.

Title: SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Authors: Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, Zhiqiang Tao
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11299
Pdf URL: https://arxiv.org/pdf/2403.11299
Copy Paste: [[2403.11299]] SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant(https://arxiv.org/abs/2403.11299)
Keywords: self-supervised
Abstract: Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.

Title: Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts

Authors: Daniel Enström, Viktor Kjellberg, Moa Johansson
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11314
Pdf URL: https://arxiv.org/pdf/2403.11314
Copy Paste: [[2403.11314]] Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts(https://arxiv.org/abs/2403.11314)
Keywords: generative
Abstract: Transformer language models are neural networks used for a wide variety of tasks concerning natural language, including some that also require logical reasoning. However, a transformer model may easily learn spurious patterns in the data, short-circuiting actual reasoning. In this paper we investigate to what extent transformers can be trained to a) approximate reasoning in propositional logic while b) avoiding known reasoning shortcuts via spurious correlations in the training data. To do so, we use a dataset with known spurious correlation between truth and e.g. the number of rules in the problem. We augment the data with proofs, and train two models: a generative transformer, WP-BART, trained on problems and their whole proofs, and a neuro-symbolic model, SIP-BART, trained on individual proof steps and combining the generative transformer model BART with a symbolic proof checker. We find that SIP-BART succeeds in avoiding reasoning shortcuts, while WP-BART does not. For SIP-BART, we then identify a few remaining reasoning errors, not previously described in the literature, arising from using a pre-trained language model. These are qualitatively analysed to create a taxonomy of four different types of additional pitfalls.

Title: Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Authors: Igor Sterner, Weizhe Lin, Jinghong Chen, Bill Byrne
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.11317
Pdf URL: https://arxiv.org/pdf/2403.11317
Copy Paste: [[2403.11317]] Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches(https://arxiv.org/abs/2403.11317)
Keywords: in-context
Abstract: Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.

Title: GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

Authors: Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11324
Pdf URL: https://arxiv.org/pdf/2403.11324
Copy Paste: [[2403.11324]] GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering(https://arxiv.org/abs/2403.11324)
Keywords: generative
Abstract: During the Gaussian Splatting optimization process, the scene's geometry can gradually deteriorate if its structure is not deliberately preserved, especially in non-textured regions such as walls, ceilings, and furniture surfaces. This degradation significantly affects the rendering quality of novel views that deviate significantly from the viewpoints in the training data. To mitigate this issue, we propose a novel approach called GeoGaussian. Based on the smoothly connected areas observed from point clouds, this method introduces a novel pipeline to initialize thin Gaussians aligned with the surfaces, where the characteristic can be transferred to new generations through a carefully designed densification strategy. Finally, the pipeline ensures that the scene's geometry and texture are maintained through constrained optimization processes with explicit geometry constraints. Benefiting from the proposed architecture, the generative ability of 3D Gaussians is enhanced, especially in structured regions. Our proposed pipeline achieves state-of-the-art performance in novel view synthesis and geometric reconstruction, as evaluated qualitatively and quantitatively on public datasets.

Title: Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

Authors: Xue Bai, Tasmiah Haque, Sumit Mohan, Yuliang Cai, Byungheon Jeong, Adam Halasz, Srinjoy Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11337
Pdf URL: https://arxiv.org/pdf/2403.11337
Copy Paste: [[2403.11337]] Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction(https://arxiv.org/abs/2403.11337)
Keywords: self-supervised
Abstract: We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications such as video conferencing, virtual reality gaming and privacy preservation for patient health monitoring. To model complex motion, we use the First Order Motion Model (FOMM) that represents dynamic objects using learned keypoints along with their local affine transformations. Keypoints are extracted by a self-supervised keypoint detector and organized in a time series corresponding to the video frames. Prediction of keypoints, to enable transmission using lower frames per second on the source device, is performed using a Variational Recurrent Neural Network (VRNN). The predicted keypoints are then synthesized to video frames using an optical flow estimator and a generator network. This efficacy of leveraging keypoint based representations in conjunction with VRNN based prediction for both video animation and reconstruction is demonstrated on three diverse datasets. For real-time applications, our results show the effectiveness of our proposed architecture by enabling up to 2x additional bandwidth reduction over existing keypoint based video motion transfer frameworks without significantly compromising video quality.

Title: DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks

Authors: Theresa Huber, Simon Schaefer, Stefan Leutenegger
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11370
Pdf URL: https://arxiv.org/pdf/2403.11370
Copy Paste: [[2403.11370]] DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks(https://arxiv.org/abs/2403.11370)
Keywords: self-supervised
Abstract: The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.

Title: Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction

Authors: Hongxiao Wang, Yang Yang, Zhuo Zhao, Pengfei Gu, Nishchal Sapkota, Danny Z. Chen
Subjects: cs.CV, cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2403.11375
Pdf URL: https://arxiv.org/pdf/2403.11375
Copy Paste: [[2403.11375]] Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction(https://arxiv.org/abs/2403.11375)
Keywords: foundation model
Abstract: For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene interactions are frequently overlooked; (2) one modality often dominates the optimization process, causing inadequate training for the other modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic" framework for cancer survival outcome prediction. First, to extract valuable biological insights, we regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data, making it adaptable for bulk RNA-seq data. Second, to address the imbalance-between-modalities problem, we propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction. The contributions of the modalities are dynamically monitored and adjusted during the training process, encouraging that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer Genome Atlas) datasets, our model achieves substantially improved survival prediction accuracy.

Title: Investigating the Benefits of Projection Head for Representation Learning

Authors: Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi, Baharan Mirzasoleiman
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.11391
Pdf URL: https://arxiv.org/pdf/2403.11391
Copy Paste: [[2403.11391]] Investigating the Benefits of Projection Head for Representation Learning(https://arxiv.org/abs/2403.11391)
Keywords: self-supervised
Abstract: An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.

Title: Automated data processing and feature engineering for deep learning and big data applications: a survey

Authors: Alhassan Mumuni amd Fuseini Mumuni
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2403.11395
Pdf URL: https://arxiv.org/pdf/2403.11395
Copy Paste: [[2403.11395]] Automated data processing and feature engineering for deep learning and big data applications: a survey(https://arxiv.org/abs/2403.11395)
Keywords: generative
Abstract: Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.

Title: DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Authors: Jeongsol Kim, Geon Yeong Park, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11415
Pdf URL: https://arxiv.org/pdf/2403.11415
Copy Paste: [[2403.11415]] DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation(https://arxiv.org/abs/2403.11415)
Keywords: diffusion
Abstract: Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called {\em DreamSampler}, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications.

Title: VmambaIR: Visual State Space Model for Image Restoration

Authors: Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, Wenming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11423
Pdf URL: https://arxiv.org/pdf/2403.11423
Copy Paste: [[2403.11423]] VmambaIR: Visual State Space Model for Image Restoration(https://arxiv.org/abs/2403.11423)
Keywords: diffusion, generative
Abstract: Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To address these challenges, we propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. We utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.

Title: BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors

Authors: Tingyang Zhang, Qingzhe Gao, Weiyu Li, Libin Liu, Baoquan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11427
Pdf URL: https://arxiv.org/pdf/2403.11427
Copy Paste: [[2403.11427]] BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors(https://arxiv.org/abs/2403.11427)
Keywords: diffusion
Abstract: Animatable 3D reconstruction has significant applications across various fields, primarily relying on artists' handcraft creation. Recently, some studies have successfully constructed animatable 3D models from monocular videos. However, these approaches require sufficient view coverage of the object within the input video and typically necessitate significant time and computational costs for training and rendering. This limitation restricts the practical applications. In this work, we propose a method to build animatable 3D Gaussian Splatting from monocular video with diffusion priors. The 3D Gaussian representations significantly accelerate the training and rendering process, and the diffusion priors allow the method to learn 3D models with limited viewpoints. We also present the rigid regularization to enhance the utilization of the priors. We perform an extensive evaluation across various real-world videos, demonstrating its superior performance compared to the current state-of-the-art methods.

Title: StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation

Authors: Jinpeng Li, Zekai Zhang, Quan Tu, Xin Cheng, Dongyan Zhao, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11439
Pdf URL: https://arxiv.org/pdf/2403.11439
Copy Paste: [[2403.11439]] StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation(https://arxiv.org/abs/2403.11439)
Keywords: generative
Abstract: Large Language Models (LLMs) demonstrate superior performance in generative scenarios and have attracted widespread attention. Among them, stylized dialogue generation is essential in the context of LLMs for building intelligent and engaging dialogue agent. However the ability of LLMs is data-driven and limited by data bias, leading to poor performance on specific tasks. In particular, stylized dialogue generation suffers from a severe lack of supervised data. Furthermore, although many prompt-based methods have been proposed to accomplish specific tasks, their performance in complex real-world scenarios involving a wide variety of dialog styles further enhancement. In this work, we first introduce a stylized dialogue dataset StyleEval with 38 styles by leveraging the generative power of LLMs comprehensively, which has been carefully constructed with rigorous human-led quality control. Based on this, we propose the stylized dialogue framework StyleChat via recitation-augmented memory strategy and multi-task style learning strategy to promote generalization ability. To evaluate the effectiveness of our approach, we created a test benchmark that included both a generation task and a choice task to comprehensively evaluate trained models and assess whether styles and preferences are remembered and understood. Experimental results show that our proposed framework StyleChat outperforms all the baselines and helps to break the style boundary of LLMs.

Title: CasSR: Activating Image Power for Real-World Image Super-Resolution

Authors: Haolan Chen, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11451
Pdf URL: https://arxiv.org/pdf/2403.11451
Copy Paste: [[2403.11451]] CasSR: Activating Image Power for Real-World Image Super-Resolution(https://arxiv.org/abs/2403.11451)
Keywords: diffusion
Abstract: The objective of image super-resolution is to generate clean and high-resolution images from degraded versions. Recent advancements in diffusion modeling have led to the emergence of various image super-resolution techniques that leverage pretrained text-to-image (T2I) models. Nevertheless, due to the prevalent severe degradation in low-resolution images and the inherent characteristics of diffusion models, achieving high-fidelity image restoration remains challenging. Existing methods often exhibit issues including semantic loss, artifacts, and the introduction of spurious content not present in the original image. To tackle this challenge, we propose Cascaded diffusion for Super-Resolution, CasSR , a novel method designed to produce highly detailed and realistic images. In particular, we develop a cascaded controllable diffusion model that aims to optimize the extraction of information from low-resolution images. This model generates a preliminary reference image to facilitate initial information extraction and degradation mitigation. Furthermore, we propose a multi-attention mechanism to enhance the T2I model's capability in maximizing the restoration of the original image content. Through a comprehensive blend of qualitative and quantitative analyses, we substantiate the efficacy and superiority of our approach.

Title: Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Authors: Siyu Xu, Yunke Wang, Daochang Liu, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11468
Pdf URL: https://arxiv.org/pdf/2403.11468
Copy Paste: [[2403.11468]] Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V(https://arxiv.org/abs/2403.11468)
Keywords: generative
Abstract: Recent advancements in generative AI have suggested that by taking visual prompt, GPT-4V can demonstrate significant proficiency in image recognition task. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier for its wide use. To address this challenge, our work introduces Collage Prompting, a budget-friendly prompting approach that concatenates multiple images into a single visual input. With collage prompt, GPT-4V is able to perform image recognition on several images simultaneously. Based on the observation that the accuracy of GPT-4V's image recognition varies significantly with the order of images within the collage prompt, our method further learns to optimize the arrangement of images for maximum recognition accuracy. A graph predictor is trained to indicate the accuracy of each collage prompt, then we propose an optimization method to navigate the search space of possible image arrangements. Experiment results across various datasets demonstrate the cost-efficiency score of collage prompt is much larger than standard prompt. Additionally, collage prompt with learned arrangement achieves clearly better accuracy than collage prompt with random arrangement in GPT-4V's visual recognition.

Title: Generative Motion Stylization within Canonical Motion Space

Authors: Jiaxu Zhang, Xin Chen, Gang Yu, Zhigang Tu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.11469
Pdf URL: https://arxiv.org/pdf/2403.11469
Copy Paste: [[2403.11469]] Generative Motion Stylization within Canonical Motion Space(https://arxiv.org/abs/2403.11469)
Keywords: diffusion, generative
Abstract: Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. Our key insight is to embed motion style into a cross-modality latent space and perceive the cross-structure skeleton topologies, allowing for motion stylization within a canonical motion space. Specifically, the large-scale Contrastive-Language-Image-Pre-training (CLIP) model is leveraged to construct the cross-modality latent space, enabling flexible style representation within this space. Additionally, two topology-encoded tokens are learned to capture the canonical and specific skeleton topologies, facilitating cross-structure topology shifting. Subsequently, the topology-shifted stylization diffusion is designed to generate motion content for the specific skeleton and stylize it in the shifted canonical motion space using multi-modality style descriptions. Through an extensive set of examples, we demonstrate the flexibility and generalizability of our pipeline across various characters and style descriptions. Qualitative and quantitative experiments underscore the superiority of our pipeline over state-of-the-art methods, consistently delivering high-quality stylized motion across a broad spectrum of skeletal structures.

Title: Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

Authors: Matthew Zurek, Yudong Chen
Subjects: cs.LG, cs.IT, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11477
Pdf URL: https://arxiv.org/pdf/2403.11477
Copy Paste: [[2403.11477]] Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs(https://arxiv.org/abs/2403.11477)
Keywords: generative
Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\tilde{O}(SA\frac{H}{\epsilon^2})$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\epsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We further investigate sample complexity in general (non-weakly-communicating) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\tilde{O}(SA\frac{B+H}{\epsilon^2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice to learn an $\epsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. Both these results circumvent the well-known lower bound of $\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$ for arbitrary $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span and transient time parameters. The weakly communicating bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.

Title: VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Authors: Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11481
Pdf URL: https://arxiv.org/pdf/2403.11481
Copy Paste: [[2403.11481]] VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding(https://arxiv.org/abs/2403.11481)
Keywords: foundation model
Abstract: We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

Title: SeisFusion: Constrained Diffusion Model with Input Guidance for 3D Seismic Data Interpolation and Reconstruction

Authors: Shuang Wang, Fei Deng, Peifan Jiang, Zishan Gong, Xiaolin Wei, Yuqing Wang
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2403.11482
Pdf URL: https://arxiv.org/pdf/2403.11482
Copy Paste: [[2403.11482]] SeisFusion: Constrained Diffusion Model with Input Guidance for 3D Seismic Data Interpolation and Reconstruction(https://arxiv.org/abs/2403.11482)
Keywords: diffusion
Abstract: Geographical, physical, or economic constraints often result in missing traces within seismic data, making the reconstruction of complete seismic data a crucial step in seismic data processing. Traditional methods for seismic data reconstruction require the selection of multiple empirical parameters and struggle to handle large-scale continuous missing data. With the development of deep learning, various neural networks have demonstrated powerful reconstruction capabilities. However, these convolutional neural networks represent a point-to-point reconstruction approach that may not cover the entire distribution of the dataset. Consequently, when dealing with seismic data featuring complex missing patterns, such networks may experience varying degrees of performance degradation. In response to this challenge, we propose a novel diffusion model reconstruction framework tailored for 3D seismic data. To constrain the results generated by the diffusion model, we introduce conditional supervision constraints into the diffusion model, constraining the generated data of the diffusion model based on the input data to be reconstructed. We introduce a 3D neural network architecture into the diffusion model, successfully extending the 2D diffusion model to 3D space. Additionally, we refine the model's generation process by incorporating missing data into the generation process, resulting in reconstructions with higher consistency. Through ablation studies determining optimal parameter values, our method exhibits superior reconstruction accuracy when applied to both field datasets and synthetic datasets, effectively addressing a wide range of complex missing patterns. Our implementation is available at https://github.com/WAL-l/SeisFusion.

Title: CCC++: Optimized Color Classified Colorization with Segment Anything Model (SAM) Empowered Object Selective Color Harmonization

Authors: Mrityunjoy Gain, Avi Deb Raha, Rameswar Debnath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11494
Pdf URL: https://arxiv.org/pdf/2403.11494
Copy Paste: [[2403.11494]] CCC++: Optimized Color Classified Colorization with Segment Anything Model (SAM) Empowered Object Selective Color Harmonization(https://arxiv.org/abs/2403.11494)
Keywords: generative
Abstract: In this paper, we formulate the colorization problem into a multinomial classification problem and then apply a weighted function to classes. We propose a set of formulas to transform color values into color classes and vice versa. To optimize the classes, we experiment with different bin sizes for color class transformation. Observing class appearance, standard deviation, and model parameters on various extremely large-scale real-time images in practice we propose 532 color classes for our classification task. During training, we propose a class-weighted function based on true class appearance in each batch to ensure proper saturation of individual objects. We adjust the weights of the major classes, which are more frequently observed, by lowering them, while escalating the weights of the minor classes, which are less commonly observed. In our class re-weight formula, we propose a hyper-parameter for finding the optimal trade-off between the major and minor appeared classes. As we apply regularization to enhance the stability of the minor class, occasional minor noise may appear at the object's edges. We propose a novel object-selective color harmonization method empowered by the Segment Anything Model (SAM) to refine and enhance these edges. We propose two new color image evaluation metrics, the Color Class Activation Ratio (CCAR), and the True Activation Ratio (TAR), to quantify the richness of color components. We compare our proposed model with state-of-the-art models using six different dataset: Place, ADE, Celeba, COCO, Oxford 102 Flower, and ImageNet, in qualitative and quantitative approaches. The experimental results show that our proposed model outstrips other models in visualization, CNR and in our proposed CCAR and TAR measurement criteria while maintaining satisfactory performance in regression (MSE, PSNR), similarity (SSIM, LPIPS, UIUI), and generative criteria (FID).

Title: Do CLIPs Always Generalize Better than ImageNet Models?

Authors: Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, Tong Zhang
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11497
Pdf URL: https://arxiv.org/pdf/2403.11497
Copy Paste: [[2403.11497]] Do CLIPs Always Generalize Better than ImageNet Models?(https://arxiv.org/abs/2403.11497)
Keywords: foundation model
Abstract: Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.

Title: Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Authors: Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11503
Pdf URL: https://arxiv.org/pdf/2403.11503
Copy Paste: [[2403.11503]] Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors(https://arxiv.org/abs/2403.11503)
Keywords: diffusion
Abstract: We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

Title: DEE: Dual-stage Explainable Evaluation Method for Text Generation

Authors: Shenyu Zhang, Yu Li, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, Guilin Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11509
Pdf URL: https://arxiv.org/pdf/2403.11509
Copy Paste: [[2403.11509]] DEE: Dual-stage Explainable Evaluation Method for Text Generation(https://arxiv.org/abs/2403.11509)
Keywords: generative
Abstract: Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.

Title: EchoReel: Enhancing Action Generation of Existing Video Diffusion Models

Authors: Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11535
Pdf URL: https://arxiv.org/pdf/2403.11535
Copy Paste: [[2403.11535]] EchoReel: Enhancing Action Generation of Existing Video Diffusion Models(https://arxiv.org/abs/2403.11535)
Keywords: diffusion
Abstract: Recent large-scale video datasets have facilitated the generation of diverse open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy of VDMs in assimilating complex knowledge from these datasets remains constrained by their inherent scale, leading to suboptimal comprehension and synthesis of numerous actions. In this paper, we introduce EchoReel, a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos, which are readily accessible from databases or online repositories. EchoReel seamlessly integrates with existing VDMs, enhancing their ability to produce realistic motions without compromising their fundamental capabilities. Specifically, the Action Prism (AP), is introduced to distill motion information from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, EchoReel incorporates new action features into VDMs through the additional layers, eliminating the need for any further fine-tuning of untrained actions. Extensive experiments demonstrate that EchoReel is not merely replicating the whole content from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.

Title: Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection

Authors: Liren He, Zhengkai Jiang, Jinlong Peng, Liang Liu, Qiangang Du, Xiaobin Hu, Wenbing Zhu, Mingmin Chi, Yabiao Wang, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11561
Pdf URL: https://arxiv.org/pdf/2403.11561
Copy Paste: [[2403.11561]] Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection(https://arxiv.org/abs/2403.11561)
Keywords: anomaly
Abstract: In the field of multi-class anomaly detection, reconstruction-based methods derived from single-class anomaly detection face the well-known challenge of ``learning shortcuts'', wherein the model fails to learn the patterns of normal samples as it should, opting instead for shortcuts such as identity mapping or artificial noise elimination. Consequently, the model becomes unable to reconstruct genuine anomalies as normal instances, resulting in a failure of anomaly detection. To counter this issue, we present a novel unified feature reconstruction-based anomaly detection framework termed RLR (Reconstruct features from a Learnable Reference representation). Unlike previous methods, RLR utilizes learnable reference representations to compel the model to learn normal feature patterns explicitly, thereby prevents the model from succumbing to the ``learning shortcuts'' issue. Additionally, RLR incorporates locality constraints into the learnable reference to facilitate more effective normal pattern capture and utilizes a masked learnable key attention mechanism to enhance robustness. Evaluation of RLR on the 15-category MVTec-AD dataset and the 12-category VisA dataset shows superior performance compared to state-of-the-art methods under the unified setting. The code of RLR will be publicly available.

Title: EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

Authors: Zhenghao Zhang, Zuozhuo Dai, Long Qin, Weizhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11568
Pdf URL: https://arxiv.org/pdf/2403.11568
Copy Paste: [[2403.11568]] EffiVED:Efficient Video Editing via Text-instruction Diffusion Models(https://arxiv.org/abs/2403.11568)
Keywords: diffusion
Abstract: Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. The datasets will be made publicly available upon publication.

Title: CRS-Diff: Controllable Generative Remote Sensing Foundation Model

Authors: Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11614
Pdf URL: https://arxiv.org/pdf/2403.11614
Copy Paste: [[2403.11614]] CRS-Diff: Controllable Generative Remote Sensing Foundation Model(https://arxiv.org/abs/2403.11614)
Keywords: diffusion, foundation model, generative
Abstract: The emergence of diffusion models has revolutionized the field of image generation, providing new methods for creating high-quality, high-resolution images across various applications. However, the potential of these models for generating domain-specific images, particularly remote sensing (RS) images, remains largely untapped. RS images that are notable for their high resolution, extensive coverage, and rich information content, bring new challenges that general diffusion models may not adequately address. This paper proposes CRS-Diff, a pioneering diffusion modeling framework specifically tailored for generating remote sensing imagery, leveraging the inherent advantages of diffusion models while integrating advanced control mechanisms to ensure that the imagery is not only visually clear but also enriched with geographic and temporal information. The model integrates global and local control inputs, enabling precise combinations of generation conditions to refine the generation process. A comprehensive evaluation of CRS-Diff has demonstrated its superior capability to generate RS imagery both in a single condition and multiple conditions compared with previous methods in terms of image quality and diversity.

Title: LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

Authors: Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, Wei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11627
Pdf URL: https://arxiv.org/pdf/2403.11627
Copy Paste: [[2403.11627]] LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models(https://arxiv.org/abs/2403.11627)
Keywords: diffusion
Abstract: Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a Low-Rank Adaptations (LoRA) fusion matrix of multiple LoRA to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, which occurs when the model cannot preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through Concept Injection Constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, Concept Isolation Constraints are introduced, refining the self-attention computation. Furthermore, Latent Re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer's performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations. Code is released at https://github.com/Young98CN/LoRA\_Composer.

Title: Arc2Face: A Foundation Model of Human Faces

Authors: Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11641
Pdf URL: https://arxiv.org/pdf/2403.11641
Copy Paste: [[2403.11641]] Arc2Face: A Foundation Model of Human Faces(https://arxiv.org/abs/2403.11641)
Keywords: diffusion, foundation model
Abstract: This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

Title: Diffusion-Based Environment-Aware Trajectory Prediction

Authors: Theodor Westny, Björn Olofsson, Erik Frisk
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11643
Pdf URL: https://arxiv.org/pdf/2403.11643
Copy Paste: [[2403.11643]] Diffusion-Based Environment-Aware Trajectory Prediction(https://arxiv.org/abs/2403.11643)
Keywords: diffusion, generative
Abstract: The ability to predict the future trajectories of traffic participants is crucial for the safe and efficient operation of autonomous vehicles. In this paper, a diffusion-based generative model for multi-agent trajectory prediction is proposed. The model is capable of capturing the complex interactions between traffic participants and the environment, accurately learning the multimodal nature of the data. The effectiveness of the approach is assessed on large-scale datasets of real-world traffic scenarios, showing that our model outperforms several well-established methods in terms of prediction accuracy. By the incorporation of differential motion constraints on the model output, we illustrate that our model is capable of generating a diverse set of realistic future trajectories. Through the use of an interaction-aware guidance signal, we further demonstrate that the model can be adapted to predict the behavior of less cooperative agents, emphasizing its practical applicability under uncertain traffic conditions.

Title: Binary Noise for Binary Tasks: Masked Bernoulli Diffusion for Unsupervised Anomaly Detection

Authors: Julia Wolleb, Florentin Bieder, Paul Friedrich, Peter Zhang, Alicia Durrer, Philippe C. Cattin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11667
Pdf URL: https://arxiv.org/pdf/2403.11667
Copy Paste: [[2403.11667]] Binary Noise for Binary Tasks: Masked Bernoulli Diffusion for Unsupervised Anomaly Detection(https://arxiv.org/abs/2403.11667)
Keywords: diffusion, anomaly
Abstract: The high performance of denoising diffusion models for image generation has paved the way for their application in unsupervised medical anomaly detection. As diffusion-based methods require a lot of GPU memory and have long sampling times, we present a novel and fast unsupervised anomaly detection approach based on latent Bernoulli diffusion models. We first apply an autoencoder to compress the input images into a binary latent representation. Next, a diffusion model that follows a Bernoulli noise schedule is employed to this latent space and trained to restore binary latent representations from perturbed ones. The binary nature of this diffusion model allows us to identify entries in the latent space that have a high probability of flipping their binary code during the denoising process, which indicates out-of-distribution data. We propose a masking algorithm based on these probabilities, which improves the anomaly detection scores. We achieve state-of-the-art performance compared to other diffusion-based unsupervised anomaly detection algorithms while significantly reducing sampling time and memory consumption. The code is available at https://github.com/JuliaWolleb/Anomaly_berdiff.

Title: TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models

Authors: Lisa Weijler, Muhammad Jehanzeb Mirza, Leon Sick, Can Ekkazan, Pedro Hermosilla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11691
Pdf URL: https://arxiv.org/pdf/2403.11691
Copy Paste: [[2403.11691]] TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models(https://arxiv.org/abs/2403.11691)
Keywords: self-supervised, foundation model
Abstract: Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples.

Title: Urban Scene Diffusion through Semantic Occupancy Map

Authors: Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11697
Pdf URL: https://arxiv.org/pdf/2403.11697
Copy Paste: [[2403.11697]] Urban Scene Diffusion through Semantic Occupancy Map(https://arxiv.org/abs/2403.11697)
Keywords: diffusion
Abstract: Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.

Title: PITA: Physics-Informed Trajectory Autoencoder

Authors: Johannes Fischer, Kevin Rösch, Martin Lauer, Christoph Stiller
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11728
Pdf URL: https://arxiv.org/pdf/2403.11728
Copy Paste: [[2403.11728]] PITA: Physics-Informed Trajectory Autoencoder(https://arxiv.org/abs/2403.11728)
Keywords: generative
Abstract: Validating robotic systems in safety-critical appli-cations requires testing in many scenarios including rare edgecases that are unlikely to occur, requiring to complement real-world testing with testing in simulation. Generative models canbe used to augment real-world datasets with generated data toproduce edge case scenarios by sampling in a learned latentspace. Autoencoders can learn said latent representation for aspecific domain by learning to reconstruct the input data froma lower-dimensional intermediate representation. However, theresulting trajectories are not necessarily physically plausible, butinstead typically contain noise that is not present in the inputtrajectory. To resolve this issue, we propose the novel Physics-Informed Trajectory Autoencoder (PITA) architecture, whichincorporates a physical dynamics model into the loss functionof the autoencoder. This results in smooth trajectories that notonly reconstruct the input trajectory but also adhere to thephysical model. We evaluate PITA on a real-world dataset ofvehicle trajectories and compare its performance to a normalautoencoder and a state-of-the-art action-space autoencoder.

Title: S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention

Authors: Pierre Guetschel, Thomas Moreau, Michael Tangermann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11772
Pdf URL: https://arxiv.org/pdf/2403.11772
Copy Paste: [[2403.11772]] S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention(https://arxiv.org/abs/2403.11772)
Keywords: self-supervised
Abstract: Motivated by the challenge of seamless cross-dataset transfer in EEG signal processing, this article presents an exploratory study on the use of Joint Embedding Predictive Architectures (JEPAs). In recent years, self-supervised learning has emerged as a promising approach for transfer learning in various domains. However, its application to EEG signals remains largely unexplored. In this article, we introduce Signal-JEPA for representing EEG recordings which includes a novel domain-specific spatial block masking strategy and three novel architectures for downstream classification. The study is conducted on a 54~subjects dataset and the downstream performance of the models is evaluated on three different BCI paradigms: motor imagery, ERP and SSVEP. Our study provides preliminary evidence for the potential of JEPAs in EEG signal encoding. Notably, our results highlight the importance of spatial filtering for accurate downstream classification and reveal an influence of the length of the pre-training examples but not of the mask size on the downstream performance.

Title: Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

Authors: Yi Wu, Ziqiang Li, Heliang Zheng, Chaoyue Wang, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11781
Pdf URL: https://arxiv.org/pdf/2403.11781
Copy Paste: [[2403.11781]] Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm(https://arxiv.org/abs/2403.11781)
Keywords: diffusion
Abstract: Drawing on recent advancements in diffusion models for text-to-image generation, identity-preserved personalization has made significant progress in accurately capturing specific identities with just a single reference image. However, existing methods primarily integrate reference images within the text embedding space, leading to a complex entanglement of image and text information, which poses challenges for preserving both identity fidelity and semantic consistency. To tackle this challenge, we propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. Specifically, we introduce identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information while deactivating the original text cross-attention module of the diffusion model. This ensures that the image stream faithfully represents the identity provided by the reference image while mitigating interference from textual input. Additionally, we introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams. This mechanism not only enhances the fidelity of identity and semantic consistency but also enables convenient control over the styles of the generated images. Extensive experimental results on both raw photo generation and style image generation demonstrate the superior performance of our proposed method.

Title: Is It Really You Who Forgot the Password? When Account Recovery Meets Risk-Based Authentication

Authors: Andre Büttner, Andreas Thue Pedersen, Stephan Wiefling, Nils Gruschka, Luigi Lo Iacono
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11798
Pdf URL: https://arxiv.org/pdf/2403.11798
Copy Paste: [[2403.11798]] Is It Really You Who Forgot the Password? When Account Recovery Meets Risk-Based Authentication(https://arxiv.org/abs/2403.11798)
Keywords: anomaly
Abstract: Risk-based authentication (RBA) is used in online services to protect user accounts from unauthorized takeover. RBA commonly uses contextual features that indicate a suspicious login attempt when the characteristic attributes of the login context deviate from known and thus expected values. Previous research on RBA and anomaly detection in authentication has mainly focused on the login process. However, recent attacks have revealed vulnerabilities in other parts of the authentication process, specifically in the account recovery function. Consequently, to ensure comprehensive authentication security, the use of anomaly detection in the context of account recovery must also be investigated. This paper presents the first study to investigate risk-based account recovery (RBAR) in the wild. We analyzed the adoption of RBAR by five prominent online services (that are known to use RBA). Our findings confirm the use of RBAR at Google, LinkedIn, and Amazon. Furthermore, we provide insights into the different RBAR mechanisms of these services and explore the impact of multi-factor authentication on them. Based on our findings, we create a first maturity model for RBAR challenges. The goal of our work is to help developers, administrators, and policy-makers gain an initial understanding of RBAR and to encourage further research in this direction.

Title: HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Authors: Sha Zhang, Jiajun Deng, Lei Bai, Houqiang Li, Wanli Ouyang, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11817
Pdf URL: https://arxiv.org/pdf/2403.11817
Copy Paste: [[2403.11817]] HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation(https://arxiv.org/abs/2403.11817)
Keywords: self-supervised
Abstract: We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised man- ner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image- plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply ob- tained by projecting the point clouds, while the bird- eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths un- der the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally comple- ment each other and together can ameliorate the learned feature representation of the point cloud stu- dent networks. Moreover, with a self-supervised pre- trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly out- performs the existing schemes. Codes are available at git@github.com:zhangsha1024/HVDistill.git.

Title: Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics

Authors: Sebastian Hartwig, Dominik Engel, Leon Sick, Hannah Kniesel, Tristan Payer, Poonam, Timo Ropinski
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2403.11821
Pdf URL: https://arxiv.org/pdf/2403.11821
Copy Paste: [[2403.11821]] Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics(https://arxiv.org/abs/2403.11821)
Keywords: foundation model
Abstract: Recent advances in text-to-image synthesis have been enabled by exploiting a combination of language and vision through foundation models. These models are pre-trained on tremendous amounts of text-image pairs sourced from the World Wide Web or other large-scale databases. As the demand for high-quality image generation shifts towards ensuring content alignment between text and image, novel evaluation metrics have been developed with the aim of mimicking human judgments. Thus, researchers have started to collect datasets with increasingly complex annotations to study the compositionality of vision-language models and their incorporation as a quality measure of compositional alignment between text and image contents. In this work, we provide a comprehensive overview of existing text-to-image evaluation metrics and propose a new taxonomy for categorizing these metrics. We also review frequently adopted text-image benchmark datasets before discussing techniques to optimize text-to-image synthesis models towards quality and human preferences. Ultimately, we derive guidelines for improving text-to-image evaluation and discuss the open challenges and current limitations.

Title: Towards Understanding the Relationship between In-context Learning and Compositional Generalization

Authors: Sungjun Han, Sebastian Padó
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11834
Pdf URL: https://arxiv.org/pdf/2403.11834
Copy Paste: [[2403.11834]] Towards Understanding the Relationship between In-context Learning and Compositional Generalization(https://arxiv.org/abs/2403.11834)
Keywords: in-context
Abstract: According to the principle of compositional generalization, the meaning of a complex expression can be understood as a function of the meaning of its parts and of how they are combined. This principle is crucial for human language processing and also, arguably, for NLP models in the face of out-of-distribution data. However, many neural network models, including Transformers, have been shown to struggle with compositional generalization. In this paper, we hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization. To test this hypothesis, we train a causal Transformer in a setting that renders ordinary learning very difficult: we present it with different orderings of the training instance and shuffle instance labels. This corresponds to training the model on all possible few-shot learning problems attainable from the dataset. The model can solve the task, however, by utilizing earlier examples to generalize to later ones (i.e. in-context learning). In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization. This indicates the usefulness of in-context learning problems as an inductive bias for generalization.

Title: GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture

Authors: Shanglong Yang, Zhipeng Yuan, Shunbao Li, Ruoling Peng, Kang Liu, Po Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11858
Pdf URL: https://arxiv.org/pdf/2403.11858
Copy Paste: [[2403.11858]] GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture(https://arxiv.org/abs/2403.11858)
Keywords: generative
Abstract: In the rapidly evolving field of artificial intelligence (AI), the application of large language models (LLMs) in agriculture, particularly in pest management, remains nascent. We aimed to prove the feasibility by evaluating the content of the pest management advice generated by LLMs, including the Generative Pre-trained Transformer (GPT) series from OpenAI and the FLAN series from Google. Considering the context-specific properties of agricultural advice, automatically measuring or quantifying the quality of text generated by LLMs becomes a significant challenge. We proposed an innovative approach, using GPT-4 as an evaluator, to score the generated content on Coherence, Logical Consistency, Fluency, Relevance, Comprehensibility, and Exhaustiveness. Additionally, we integrated an expert system based on crop threshold data as a baseline to obtain scores for Factual Accuracy on whether pests found in crop fields should take management action. Each model's score was weighted by percentage to obtain a final score. The results showed that GPT-3.4 and GPT-4 outperform the FLAN models in most evaluation categories. Furthermore, the use of instruction-based prompting containing domain-specific knowledge proved the feasibility of LLMs as an effective tool in agriculture, with an accuracy rate of 72%, demonstrating LLMs' effectiveness in providing pest management suggestions.

Title: IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images

Authors: Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, Liang Lin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11870
Pdf URL: https://arxiv.org/pdf/2403.11870
Copy Paste: [[2403.11870]] IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images(https://arxiv.org/abs/2403.11870)
Keywords: diffusion, generative
Abstract: Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.

Title: CO3: Low-resource Contrastive Co-training for Generative Conversational Query Rewrite

Authors: Yifei Yuan, Chen Shi, Runze Wang, Liyi Chen, Renjun Hu, Zengming Zhang, Feijun Jiang, Wai Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11873
Pdf URL: https://arxiv.org/pdf/2403.11873
Copy Paste: [[2403.11873]] CO3: Low-resource Contrastive Co-training for Generative Conversational Query Rewrite(https://arxiv.org/abs/2403.11873)
Keywords: generative
Abstract: Generative query rewrite generates reconstructed query rewrites using the conversation history while rely heavily on gold rewrite pairs that are expensive to obtain. Recently, few-shot learning is gaining increasing popularity for this task, whereas these methods are sensitive to the inherent noise due to limited data size. Besides, both attempts face performance degradation when there exists language style shift between training and testing cases. To this end, we study low-resource generative conversational query rewrite that is robust to both noise and language style shift. The core idea is to utilize massive unlabeled data to make further improvements via a contrastive co-training paradigm. Specifically, we co-train two dual models (namely Rewriter and Simplifier) such that each of them provides extra guidance through pseudo-labeling for enhancing the other in an iterative manner. We also leverage contrastive learning with data augmentation, which enables our model pay more attention on the truly valuable information than the noise. Extensive experiments demonstrate the superiority of our model under both few-shot and zero-shot scenarios. We also verify the better generalization ability of our model when encountering language style shift.

Title: InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting

Authors: Jiaxiang Tang, Ruijie Lu, Xiaokang Chen, Xiang Wen, Gang Zeng, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11878
Pdf URL: https://arxiv.org/pdf/2403.11878
Copy Paste: [[2403.11878]] InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting(https://arxiv.org/abs/2403.11878)
Keywords: diffusion
Abstract: Text-to-texture synthesis has become a new frontier in 3D content creation thanks to the recent advances in text-to-image models. Existing methods primarily adopt a combination of pretrained depth-aware diffusion and inpainting models, yet they exhibit shortcomings such as 3D inconsistency and limited controllability. To address these challenges, we introduce InteX, a novel framework for interactive text-to-texture synthesis. 1) InteX includes a user-friendly interface that facilitates interaction and control throughout the synthesis process, enabling region-specific repainting and precise texture editing. 2) Additionally, we develop a unified depth-aware inpainting model that integrates depth information with inpainting cues, effectively mitigating 3D inconsistencies and improving generation speed. Through extensive experiments, our framework has proven to be both practical and effective in text-to-texture synthesis, paving the way for high-quality 3D content creation.

Title: ReGenNet: Towards Human Action-Reaction Synthesis

Authors: Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11882
Pdf URL: https://arxiv.org/pdf/2403.11882
Copy Paste: [[2403.11882]] ReGenNet: Towards Human Action-Reaction Synthesis(https://arxiv.org/abs/2403.11882)
Keywords: diffusion, generative
Abstract: Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.

Title: SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

Authors: Xiangyu Chen, Jing Liu, Ye Wang, Pu (Perry)Wang, Matthew Brand, Guanghui Wang, Toshiaki Koike-Akino
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11887
Pdf URL: https://arxiv.org/pdf/2403.11887
Copy Paste: [[2403.11887]] SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules(https://arxiv.org/abs/2403.11887)
Keywords: diffusion
Abstract: Low-rank adaptation (LoRA) and its variants are widely employed in fine-tuning large models, including large language models for natural language processing and diffusion models for computer vision. This paper proposes a generalized framework called SuperLoRA that unifies and extends different LoRA variants, which can be realized under different hyper-parameter settings. Introducing grouping, folding, shuffling, projecting, and tensor factoring, SuperLoRA offers high flexibility compared with other LoRA variants and demonstrates superior performance for transfer learning tasks especially in the extremely few-parameter regimes.

Title: CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification

Authors: Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11904
Pdf URL: https://arxiv.org/pdf/2403.11904
Copy Paste: [[2403.11904]] CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification(https://arxiv.org/abs/2403.11904)
Keywords: in-context
Abstract: Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a tf-idf representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.

Title: LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Authors: Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, Hang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11929
Pdf URL: https://arxiv.org/pdf/2403.11929
Copy Paste: [[2403.11929]] LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model(https://arxiv.org/abs/2403.11929)
Keywords: diffusion, generative
Abstract: Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.

Title: Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment

Authors: Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11956
Pdf URL: https://arxiv.org/pdf/2403.11956
Copy Paste: [[2403.11956]] Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment(https://arxiv.org/abs/2403.11956)
Keywords: generative
Abstract: With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at https://github.com/QMME/T2VQA.

Title: Transfer Learning Beyond Bounded Density Ratios

Authors: Alkis Kalavasis, Ilias Zadik, Manolis Zampetakis
Subjects: cs.LG, cs.DS, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11963
Pdf URL: https://arxiv.org/pdf/2403.11963
Copy Paste: [[2403.11963]] Transfer Learning Beyond Bounded Density Ratios(https://arxiv.org/abs/2403.11963)
Keywords: in-context
Abstract: We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$. A standard change of measure argument implies that transfer learning happens when the density ratio $dQ/dP$ is bounded. Yet, prior thought-provoking works by Kpotufe and Martinet (COLT, 2018) and Hanneke and Kpotufe (NeurIPS, 2019) demonstrate cases where the ratio $dQ/dP$ is unbounded, but transfer learning is possible. In this work, we focus on transfer learning over the class of low-degree polynomial estimators. Our main result is a general transfer inequality over the domain $\mathbb{R}^n$, proving that non-trivial transfer learning for low-degree polynomials is possible under very mild assumptions, going well beyond the classical assumption that $dQ/dP$ is bounded. For instance, it always applies if $Q$ is a log-concave measure and the inverse ratio $dP/dQ$ is bounded. To demonstrate the applicability of our inequality, we obtain new results in the settings of: (1) the classical truncated regression setting, where $dQ/dP$ equals infinity, and (2) the more recent out-of-distribution generalization setting for in-context learning linear functions with transformers. We also provide a discrete analogue of our transfer inequality on the Boolean Hypercube $\{-1,1\}^n$, and study its connections with the recent problem of Generalization on the Unseen of Abbe, Bengio, Lotfi and Rizk (ICML, 2023). Our main conceptual contribution is that the maximum influence of the error of the estimator $\widehat{f}-f^*$ under $Q$, $\mathrm{I}_{\max}(\widehat{f}-f^*)$, acts as a sufficient condition for transferability; when $\mathrm{I}_{\max}(\widehat{f}-f^*)$ is appropriately bounded, transfer is possible over the Boolean domain.

Title: Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory

Authors: Hengyu Fu, Zhuoran Yang, Mengdi Wang, Minshuo Chen
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11968
Pdf URL: https://arxiv.org/pdf/2403.11968
Copy Paste: [[2403.11968]] Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory(https://arxiv.org/abs/2403.11968)
Keywords: diffusion
Abstract: Conditional diffusion models serve as the foundation of modern image synthesis and find extensive application in fields like computational biology and reinforcement learning. In these applications, conditional diffusion models incorporate various conditional information, such as prompt input, to guide the sample generation towards desired properties. Despite the empirical success, theory of conditional diffusion models is largely missing. This paper bridges this gap by presenting a sharp statistical theory of distribution estimation using conditional diffusion models. Our analysis yields a sample complexity bound that adapts to the smoothness of the data distribution and matches the minimax lower bound. The key to our theoretical development lies in an approximation result for the conditional score function, which relies on a novel diffused Taylor approximation technique. Moreover, we demonstrate the utility of our statistical theory in elucidating the performance of conditional diffusion models across diverse applications, including model-based transition kernel estimation in reinforcement learning, solving inverse problems, and reward conditioned sample generation.

Title: Diffusion Denoising as a Certified Defense against Clean-label Poisoning

Authors: Sanghyun Hong, Nicholas Carlini, Alexey Kurakin
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11981
Pdf URL: https://arxiv.org/pdf/2403.11981
Copy Paste: [[2403.11981]] Diffusion Denoising as a Certified Defense against Clean-label Poisoning(https://arxiv.org/abs/2403.11981)
Keywords: diffusion
Abstract: We present a certified defense to clean-label poisoning attacks. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain $p$-norm bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $denoised$ $smoothing$, we show how an off-the-shelf diffusion model can sanitize the tampered training data. We extensively test our defense against seven clean-label poisoning attacks and reduce their attack success to 0-16% with only a negligible drop in the test time accuracy. We compare our defense with existing countermeasures against clean-label poisoning, showing that the defense reduces the attack success the most and offers the best model utility. Our results highlight the need for future work on developing stronger clean-label attacks and using our certified yet practical defense as a strong baseline to evaluate these attacks.

Title: Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching

Authors: Andrew Katz, Mitchell Gerhardt, Michelle Soledad
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2403.11984
Pdf URL: https://arxiv.org/pdf/2403.11984
Copy Paste: [[2403.11984]] Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching(https://arxiv.org/abs/2403.11984)
Keywords: generative
Abstract: Feedback is a critical aspect of improvement. Unfortunately, when there is a lot of feedback from multiple sources, it can be difficult to distill the information into actionable insights. Consider student evaluations of teaching (SETs), which are important sources of feedback for educators. They can give instructors insights into what worked during a semester. A collection of SETs can also be useful to administrators as signals for courses or entire programs. However, on a large scale as in high-enrollment courses or administrative records over several years, the volume of SETs can render them difficult to analyze. In this paper, we discuss a novel method for analyzing SETs using natural language processing (NLP) and large language models (LLMs). We demonstrate the method by applying it to a corpus of 5,000 SETs from a large public university. We show that the method can be used to extract, embed, cluster, and summarize the SETs to identify the themes they express. More generally, this work illustrates how to use the combination of NLP techniques and LLMs to generate a codebook for SETs. We conclude by discussing the implications of this method for analyzing SETs and other types of student writing in teaching and research settings.

Title: GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation

Authors: Zhaoyang Lyu, Ben Fei, Jinyi Wang, Xudong Xu, Ya Zhang, Weidong Yang, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11990
Pdf URL: https://arxiv.org/pdf/2403.11990
Copy Paste: [[2403.11990]] GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation(https://arxiv.org/abs/2403.11990)
Keywords: generative
Abstract: Mesh is a fundamental representation of 3D assets in various industrial applications, and is widely supported by professional softwares. However, due to its irregular structure, mesh creation and manipulation is often time-consuming and labor-intensive. In this paper, we propose a highly controllable generative model, GetMesh, for mesh generation and manipulation across different categories. By taking a varying number of points as the latent representation, and re-organizing them as triplane representation, GetMesh generates meshes with rich and sharp details, outperforming both single-category and multi-category counterparts. Moreover, it also enables fine-grained control over the generation process that previous mesh generative models cannot achieve, where changing global/local mesh topologies, adding/removing mesh parts, and combining mesh parts across categories can be intuitively, efficiently, and robustly accomplished by adjusting the number, positions or features of latent points. Project page is https://getmesh.github.io.

Title: Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Authors: Markus J. Buehler
Subjects: cs.LG, cond-mat.mes-hall, cond-mat.mtrl-sci, cond-mat.soft, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11996
Pdf URL: https://arxiv.org/pdf/2403.11996
Copy Paste: [[2403.11996]] Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning(https://arxiv.org/abs/2403.11996)
Keywords: generative
Abstract: Using generative Artificial Intelligence (AI), we transformed a set of 1,000 scientific papers in the area of biological materials into detailed ontological knowledge graphs, revealing their inherently scale-free nature. Using graph traversal path detection between dissimilar concepts based on combinatorial ranking of node similarity and betweenness centrality, we reveal deep insights into unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, and propose never-before-seen material designs and their behaviors. One comparison revealed detailed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. The algorithm further created an innovative hierarchical mycelium-based composite that incorporates joint synthesis of graph sampling with principles extracted from Kandinsky's Composition VII painting, where the resulting composite reflects a balance of chaos and order, with features like adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across physical, biological, and artistic spheres, revealing a nuanced ontology of immanence and material flux that resonates with postmodern philosophy, and positions these interconnections within a heterarchical framework. Our findings reveal the dynamic, context-dependent interplay of entities beyond traditional hierarchical paradigms, emphasizing the significant role of individual components and their fluctuative relationships within the system. Our predictions achieve a far higher degree of novelty, technical detail and explorative capacity than conventional generative AI methods. The approach establishes a widely useful framework for innovation by revealing hidden connections that facilitate discovery.

Title: Learning Useful Representations of Recurrent Neural Network Weight Matrices

Authors: Vincent Herrmann, Francesco Faccio, Jürgen Schmidhuber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.11998
Pdf URL: https://arxiv.org/pdf/2403.11998
Copy Paste: [[2403.11998]] Learning Useful Representations of Recurrent Neural Network Weight Matrices(https://arxiv.org/abs/2403.11998)
Keywords: self-supervised, generative
Abstract: Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality -- specifically, its input-output mapping. We consider several mechanistic approaches for RNN weights and adapt the permutation equivariant Deep Weight Space layer for RNNs. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. We develop a theoretical framework that demonstrates conditions under which the functionalist approach can generate rich representations that help determine RNN behavior. We create and release the first two 'model zoo' datasets for RNN weight representation learning. One consists of generative models of a class of formal languages, and the other one of classifiers of sequentially processed MNIST digits. With the help of an emulation-based self-supervised learning technique we compare and evaluate the different RNN weight encoding techniques on multiple downstream applications. On the most challenging one, namely predicting which exact task the RNN was trained on, functionalist approaches show clear superiority.

Title: DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing

Authors: Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12002
Pdf URL: https://arxiv.org/pdf/2403.12002
Copy Paste: [[2403.12002]] DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing(https://arxiv.org/abs/2403.12002)
Keywords: diffusion
Abstract: Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

Title: GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

Authors: Xiaojie Li, Yibo Yang, Xiangtai Li, Jianlong Wu, Yue Yu, Bernard Ghanem, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12003
Pdf URL: https://arxiv.org/pdf/2403.12003
Copy Paste: [[2403.12003]] GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning(https://arxiv.org/abs/2403.12003)
Keywords: self-supervised, generative
Abstract: Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting in limited diversity and potentially false positive pairs. To tackle these challenges, we present GenView, a controllable framework that augments the diversity of positive views leveraging the power of pretrained generative models while preserving semantics. We develop an adaptive view generation method that dynamically adjusts the noise level in sampling to ensure the preservation of essential semantic meaning while introducing variability. Additionally, we introduce a quality-driven contrastive loss, which assesses the quality of positive pairs by considering both foreground similarity and background diversity. This loss prioritizes the high-quality positive pairs we construct while reducing the influence of low-quality pairs, thereby mitigating potential semantic inconsistencies introduced by generative models and aggressive data augmentation. Thanks to the improved positive view quality and the quality-driven contrastive loss, GenView significantly improves self-supervised learning across various tasks. For instance, GenView improves MoCov2 performance by 2.5%/2.2% on ImageNet linear/semi-supervised classification. Moreover, GenView even performs much better than naively augmenting the ImageNet dataset with Laion400M or ImageNet21K. Code is available at https://github.com/xiaojieli0903/genview.

Title: SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Authors: Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12008
Pdf URL: https://arxiv.org/pdf/2403.12008
Copy Paste: [[2403.12008]] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion(https://arxiv.org/abs/2403.12008)
Keywords: diffusion, generative
Abstract: We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.

Title: Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural Networks

Authors: K. P. Santoso, R. V. H. Ginardi, R. A. Sastrowardoyo, F. A. Madany
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12009
Pdf URL: https://arxiv.org/pdf/2403.12009
Copy Paste: [[2403.12009]] Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural Networks(https://arxiv.org/abs/2403.12009)
Keywords: generative
Abstract: In the realm of skin lesion image classification, the intricate spatial and semantic features pose significant challenges for conventional Convolutional Neural Network (CNN)-based methodologies. These challenges are compounded by the imbalanced nature of skin lesion datasets, which hampers the ability of models to learn minority class features effectively. Despite augmentation strategies, such as those using Generative Adversarial Networks (GANs), previous attempts have not fully addressed these complexities. This study introduces an innovative approach by integrating Graph Neural Networks (GNNs) with Capsule Networks to enhance classification performance. GNNs, known for their proficiency in handling graph-structured data, offer an advanced mechanism for capturing complex patterns and relationships beyond the capabilities of traditional CNNs. Capsule Networks further contribute by providing superior recognition of spatial hierarchies within images. Our research focuses on evaluating and enhancing the Tiny Pyramid Vision GNN (Tiny Pyramid ViG) architecture by incorporating it with a Capsule Network. This hybrid model was applied to the MNIST:HAM10000 dataset, a comprehensive skin lesion dataset designed for benchmarking classification models. After 75 epochs of training, our model achieved a significant accuracy improvement, reaching 89.23% and 95.52%, surpassing established benchmarks such as GoogLeNet (83.94%), InceptionV3 (86.82%), MobileNet V3 (89.87%), EfficientNet-7B (92.07%), ResNet18 (92.22%), ResNet34 (91.90%), ViT-Base (73.70%), and IRv2-SA (93.47%) on the same dataset. This outcome underscores the potential of our approach in overcoming the inherent challenges of skin lesion classification, contributing to the advancement of image-based diagnosis in dermatology.

Title: VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

Authors: Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2403.12010
Pdf URL: https://arxiv.org/pdf/2403.12010
Copy Paste: [[2403.12010]] VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model(https://arxiv.org/abs/2403.12010)
Keywords: diffusion, generative
Abstract: Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV.

Title: HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Authors: Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12011
Pdf URL: https://arxiv.org/pdf/2403.12011
Copy Paste: [[2403.12011]] HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data(https://arxiv.org/abs/2403.12011)
Keywords: diffusion
Abstract: 3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis, we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: https://mq-zhang1.github.io/HOIDiffusion

Title: GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Authors: Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, Xiaoxiao Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12013
Pdf URL: https://arxiv.org/pdf/2403.12013
Copy Paste: [[2403.12013]] GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image(https://arxiv.org/abs/2403.12013)
Keywords: diffusion, foundation model, generative
Abstract: We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenarios or suffer from the inability to capture geometric details. In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem. We further show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. Specifically, we extend the original stable diffusion model to jointly predict depth and normal, allowing mutual information exchange and high consistency between the two representations. More importantly, we propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions. This strategy enables our model to recognize different scene layouts, capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot depth and normal prediction, significantly enhancing many downstream applications such as 3D reconstruction, 2D content creation, and novel viewpoint synthesis.

Title: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Authors: Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12015
Pdf URL: https://arxiv.org/pdf/2403.12015
Copy Paste: [[2403.12015]] Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation(https://arxiv.org/abs/2403.12015)
Keywords: diffusion, generative
Abstract: Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

Title: LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Authors: Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12019
Pdf URL: https://arxiv.org/pdf/2403.12019
Copy Paste: [[2403.12019]] LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation(https://arxiv.org/abs/2403.12019)
Keywords: diffusion, generative
Abstract: The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

Title: From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Authors: Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2403.12027
Pdf URL: https://arxiv.org/pdf/2403.12027
Copy Paste: [[2403.12027]] From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models(https://arxiv.org/abs/2403.12027)
Keywords: foundation model
Abstract: Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models (LLMs), have revolutionized various natural language processing (NLP) tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. The paper begins by defining chart understanding, outlining problem formulations, and discussing fundamental building blocks crucial for studying chart understanding tasks. In the section on tasks and datasets, we explore various tasks within chart understanding and discuss their evaluation metrics and sources of both charts and textual inputs. Modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed in a dedicated section, highlighting issues such as domain-specific charts, lack of efforts in evaluation, and agent-oriented settings. This survey paper serves to provide valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.

Title: Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Authors: Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, Leonidas Guibas
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.12032
Pdf URL: https://arxiv.org/pdf/2403.12032
Copy Paste: [[2403.12032]] Generic 3D Diffusion Adapter Using Controlled Multi-View Editing(https://arxiv.org/abs/2403.12032)
Keywords: diffusion
Abstract: Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

Title: VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

Authors: Junlin Han, Filippos Kokkinos, Philip Torr
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12034
Pdf URL: https://arxiv.org/pdf/2403.12034
Copy Paste: [[2403.12034]] VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models(https://arxiv.org/abs/2403.12034)
Keywords: diffusion, generative
Abstract: This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.

Title: One-Step Image Translation with Text-to-Image Models

Authors: Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12036
Pdf URL: https://arxiv.org/pdf/2403.12036
Copy Paste: [[2403.12036]] One-Step Image Translation with Text-to-Image Models(https://arxiv.org/abs/2403.12036)
Keywords: diffusion
Abstract: In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.

Title: MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Authors: Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12037
Pdf URL: https://arxiv.org/pdf/2403.12037
Copy Paste: [[2403.12037]] MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control(https://arxiv.org/abs/2403.12037)
Keywords: diffusion
Abstract: It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent's imaginative ability reveals its generalization and comprehension of the open world.

Title: Zero-Shot Image Feature Consensus with Deep Functional Maps

Authors: Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12038
Pdf URL: https://arxiv.org/pdf/2403.12038
Copy Paste: [[2403.12038]] Zero-Shot Image Feature Consensus with Deep Functional Maps(https://arxiv.org/abs/2403.12038)
Keywords: generative
Abstract: Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.

Title: Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Authors: Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12042
Pdf URL: https://arxiv.org/pdf/2403.12042
Copy Paste: [[2403.12042]] Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation(https://arxiv.org/abs/2403.12042)
Keywords: diffusion, generative
Abstract: In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{https://github.com/buxiangzhiren/VD-IT}