2024-03-21

Title: Fundamental Components of Deep Learning: A category-theoretic approach

Authors: Bruno Gavranović
Subjects: cs.LG, cs.AI, math.CT
Abstract URL: https://arxiv.org/abs/2403.13001
Pdf URL: https://arxiv.org/pdf/2403.13001
Copy Paste: [[2403.13001]] Fundamental Components of Deep Learning: A category-theoretic approach(https://arxiv.org/abs/2403.13001)
Keywords: in-context
Abstract: Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

Title: Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

Authors: Bach Nguyen-Xuan, Thien Nguyen-Hoang, Nhu Tai-Do
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13039
Pdf URL: https://arxiv.org/pdf/2403.13039
Copy Paste: [[2403.13039]] Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition(https://arxiv.org/abs/2403.13039)
Keywords: self-supervised
Abstract: Facial Expression Recognition (FER) is a critical task within computer vision with diverse applications across various domains. Addressing the challenge of limited FER datasets, which hampers the generalization capability of expression recognition models, is imperative for enhancing performance. Our paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and Fusion Attention mechanism for expression classification, particularly showcased in the 6th Affective Behavior Analysis in-the-wild (ABAW) competition. Additionally, we propose preprocessing techniques to emphasize essential facial features, thereby enhancing model performance on both training and validation sets, notably demonstrated on the Aff-wild2 dataset.

Title: Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Authors: Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13044
Pdf URL: https://arxiv.org/pdf/2403.13044
Copy Paste: [[2403.13044]] Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos(https://arxiv.org/abs/2403.13044)
Keywords: diffusion, generative
Abstract: We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserves the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user's input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.

Title: Automatic Summarization of Doctor-Patient Encounter Dialogues Using Large Language Model through Prompt Tuning

Authors: Mengxian Lyu, Cheng Peng, Xiaohan Li, Patrick Balian, Jiang Bian, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.13089
Pdf URL: https://arxiv.org/pdf/2403.13089
Copy Paste: [[2403.13089]] Automatic Summarization of Doctor-Patient Encounter Dialogues Using Large Language Model through Prompt Tuning(https://arxiv.org/abs/2403.13089)
Keywords: generative
Abstract: Automatic text summarization (ATS) is an emerging technology to assist clinicians in providing continuous and coordinated care. This study presents an approach to summarize doctor-patient dialogues using generative large language models (LLMs). We developed prompt-tuning algorithms to instruct generative LLMs to summarize clinical text. We examined the prompt-tuning strategies, the size of soft prompts, and the few-short learning ability of GatorTronGPT, a generative clinical LLM developed using 277 billion clinical and general English words with up to 20 billion parameters. We compared GatorTronGPT with a previous solution based on fine-tuning of a widely used T5 model, using a clinical benchmark dataset MTS-DIALOG. The experimental results show that the GatorTronGPT- 20B model achieved the best performance on all evaluation metrics. The proposed solution has a low computing cost as the LLM parameters are not updated during prompt-tuning. This study demonstrates the efficiency of generative clinical LLMs for clinical ATS through prompt tuning.

Title: Better Call SAL: Towards Learning to Segment Anything in Lidar

Authors: Aljoša Ošep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, Laura Leal-Taixé
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.13129
Pdf URL: https://arxiv.org/pdf/2403.13129
Copy Paste: [[2403.13129]] Better Call SAL: Towards Learning to Segment Anything in Lidar(https://arxiv.org/abs/2403.13129)
Keywords: foundation model
Abstract: We propose $\texttt{SAL}$ ($\texttt{S}$egment $\texttt{A}$nything in $\texttt{L}$idar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for $\textit{Lidar Panoptic Segmentation}$ (LPS) relies on manual supervision for a handful of object classes defined a priori, we utilize 2D vision foundation models to generate 3D supervision "for free". Our pseudo-labels consist of instance masks and corresponding CLIP tokens, which we lift to Lidar using calibrated multi-modal data. By training our model on these labels, we distill the 2D foundation models into our Lidar $\texttt{SAL}$ model. Even without manual labels, our model achieves $91\%$ in terms of class-agnostic segmentation and $44\%$ in terms of zero-shot LPS of the fully supervised state-of-the-art. Furthermore, we outperform several baselines that do not distill but only lift image features to 3D. More importantly, we demonstrate that $\texttt{SAL}$ supports arbitrary class prompts, can be easily extended to new datasets, and shows significant potential to improve with increasing amounts of self-labeled data.

Title: Self-generated Replay Memories for Continual Neural Machine Translation

Authors: Michele Resta, Davide Bacciu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13130
Pdf URL: https://arxiv.org/pdf/2403.13130
Copy Paste: [[2403.13130]] Self-generated Replay Memories for Continual Neural Machine Translation(https://arxiv.org/abs/2403.13130)
Keywords: generative
Abstract: Modern Neural Machine Translation systems exhibit strong performance in several different languages and are constantly improving. Their ability to learn continuously is, however, still severely limited by the catastrophic forgetting issue. In this work, we leverage a key property of encoder-decoder Transformers, i.e. their generative ability, to propose a novel approach to continually learning Neural Machine Translation systems. We show how this can effectively learn on a stream of experiences comprising different languages, by leveraging a replay memory populated by using the model itself as a generator of parallel sentences. We empirically demonstrate that our approach can counteract catastrophic forgetting without requiring explicit memorization of training data. Code will be publicly available upon publication. Code: https://github.com/m-resta/sg-rep

Title: VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning

Authors: Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.13164
Pdf URL: https://arxiv.org/pdf/2403.13164
Copy Paste: [[2403.13164]] VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning(https://arxiv.org/abs/2403.13164)
Keywords: in-context
Abstract: Large language models (LLMs) famously exhibit emergent in-context learning (ICL) -- the ability to rapidly adapt to new tasks using few-shot examples provided as a prompt, without updating the model's weights. Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding. However, investigations into \emph{multimodal ICL} have predominantly focused on few-shot visual question answering (VQA), and image captioning, which we will show neither exploit the strengths of ICL, nor test its limitations. The broader capabilities and limitations of multimodal ICL remain under-explored. In this study, we introduce a comprehensive benchmark VL-ICL Bench for multimodal in-context learning, encompassing a broad spectrum of tasks that involve both images and text as inputs and outputs, and different types of challenges, from {perception to reasoning and long context length}. We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite, revealing their diverse strengths and weaknesses, and showing that even the most advanced models, such as GPT-4, find the tasks challenging. By highlighting a range of new ICL tasks, and the associated strengths and limitations of existing models, we hope that our dataset will inspire future work on enhancing the in-context learning capabilities of VLLMs, as well as inspire new applications that leverage VLLM ICL. The code and dataset are available at https://github.com/ys-zong/VL-ICL.

Title: Predictive, scalable and interpretable knowledge tracing on structured domains

Authors: Hanqi Zhou, Robert Bamler, Charley M. Wu, Álvaro Tejero-Cantero
Subjects: cs.LG, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2403.13179
Pdf URL: https://arxiv.org/pdf/2403.13179
Copy Paste: [[2403.13179]] Predictive, scalable and interpretable knowledge tracing on structured domains(https://arxiv.org/abs/2403.13179)
Keywords: generative
Abstract: Intelligent tutoring systems optimize the selection and timing of learning materials to enhance understanding and long-term retention. This requires estimates of both the learner's progress (''knowledge tracing''; KT), and the prerequisite structure of the learning domain (''knowledge mapping''). While recent deep learning models achieve high KT accuracy, they do so at the expense of the interpretability of psychologically-inspired models. In this work, we present a solution to this trade-off. PSI-KT is a hierarchical generative approach that explicitly models how both individual cognitive traits and the prerequisite structure of knowledge influence learning dynamics, thus achieving interpretability by design. Moreover, by using scalable Bayesian inference, PSI-KT targets the real-world need for efficient personalization even with a growing body of learners and learning histories. Evaluated on three datasets from online learning platforms, PSI-KT achieves superior multi-step predictive accuracy and scalable inference in continual-learning settings, all while providing interpretable representations of learner-specific traits and the prerequisite structure of knowledge that causally supports learning. In sum, predictive, scalable and interpretable knowledge tracing with solid knowledge mapping lays a key foundation for effective personalized learning to make education accessible to a broad, global audience.

Title: Depth-guided NeRF Training via Earth Mover's Distance

Authors: Anita Rau, Josiah Aklilu, F. Christopher Holsinger, Serena Yeung-Levy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.13206
Pdf URL: https://arxiv.org/pdf/2403.13206
Copy Paste: [[2403.13206]] Depth-guided NeRF Training via Earth Mover's Distance(https://arxiv.org/abs/2403.13206)
Keywords: diffusion
Abstract: Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.

Title: Diffusion Model for Data-Driven Black-Box Optimization

Authors: Zihao Li, Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Yinyu Ye, Minshuo Chen, Mengdi Wang
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2403.13219
Pdf URL: https://arxiv.org/pdf/2403.13219
Copy Paste: [[2403.13219]] Diffusion Model for Data-Driven Black-Box Optimization(https://arxiv.org/abs/2403.13219)
Keywords: diffusion, generative
Abstract: Generative AI has redefined artificial intelligence, enabling the creation of innovative content and customized solutions that drive business practices into a new era of efficiency and creativity. In this paper, we focus on diffusion models, a powerful generative AI technology, and investigate their potential for black-box optimization over complex structured variables. Consider the practical scenario where one wants to optimize some structured design in a high-dimensional space, based on massive unlabeled data (representing design variables) and a small labeled dataset. We study two practical types of labels: 1) noisy measurements of a real-valued reward function and 2) human preference based on pairwise comparisons. The goal is to generate new designs that are near-optimal and preserve the designed latent structures. Our proposed method reformulates the design optimization problem into a conditional sampling problem, which allows us to leverage the power of diffusion models for modeling complex distributions. In particular, we propose a reward-directed conditional diffusion model, to be trained on the mixed data, for sampling a near-optimal solution conditioned on high predicted rewards. Theoretically, we establish sub-optimality error bounds for the generated designs. The sub-optimality gap nearly matches the optimal guarantee in off-policy bandits, demonstrating the efficiency of reward-directed diffusion models for black-box optimization. Moreover, when the data admits a low-dimensional latent subspace structure, our model efficiently generates high-fidelity designs that closely respect the latent structure. We provide empirical experiments validating our model in decision-making and content-creation tasks.

Title: Beyond Skeletons: Integrative Latent Mapping for Coherent 4D Sequence Generation

Authors: Qitong Yang, Mingtao Feng, Zijie Wu, Shijie Sun, Weisheng Dong, Yaonan Wang, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13238
Pdf URL: https://arxiv.org/pdf/2403.13238
Copy Paste: [[2403.13238]] Beyond Skeletons: Integrative Latent Mapping for Coherent 4D Sequence Generation(https://arxiv.org/abs/2403.13238)
Keywords: diffusion
Abstract: Directly learning to model 4D content, including shape, color and motion, is challenging. Existing methods depend on skeleton-based motion control and offer limited continuity in detail. To address this, we propose a novel framework that generates coherent 4D sequences with animation of 3D shapes under given conditions with dynamic evolution of shape and color over time through integrative latent mapping. We first employ an integrative latent unified representation to encode shape and color information of each detailed 3D geometry frame. The proposed skeleton-free latent 4D sequence joint representation allows us to leverage diffusion models in a low-dimensional space to control the generation of 4D sequences. Finally, temporally coherent 4D sequences are generated conforming well to the input images and text prompts. Extensive experiments on the ShapeNet, 3DBiCar and DeformingThings4D datasets for several tasks demonstrate that our method effectively learns to generate quality 3D shapes with color and 4D mesh animations, improving over the current state-of-the-art. Source code will be released.

Title: SAMCT: Segment Any CT Allowing Labor-Free Task-Indicator Prompts

Authors: Xian Lin, Yangyang Xiang, Zhehao Wang, Kwang-Ting Cheng, Zengqiang Yan, Li Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13258
Pdf URL: https://arxiv.org/pdf/2403.13258
Copy Paste: [[2403.13258]] SAMCT: Segment Any CT Allowing Labor-Free Task-Indicator Prompts(https://arxiv.org/abs/2403.13258)
Keywords: foundation model
Abstract: Segment anything model (SAM), a foundation model with superior versatility and generalization across diverse segmentation tasks, has attracted widespread attention in medical imaging. However, it has been proved that SAM would encounter severe performance degradation due to the lack of medical knowledge in training and local feature encoding. Though several SAM-based models have been proposed for tuning SAM in medical imaging, they still suffer from insufficient feature extraction and highly rely on high-quality prompts. In this paper, we construct a large CT dataset consisting of 1.1M CT images and 5M masks from public datasets and propose a powerful foundation model SAMCT allowing labor-free prompts. Specifically, based on SAM, SAMCT is further equipped with a U-shaped CNN image encoder, a cross-branch interaction module, and a task-indicator prompt encoder. The U-shaped CNN image encoder works in parallel with the ViT image encoder in SAM to supplement local features. Cross-branch interaction enhances the feature expression capability of the CNN image encoder and the ViT image encoder by exchanging global perception and local features from one to the other. The task-indicator prompt encoder is a plug-and-play component to effortlessly encode task-related indicators into prompt embeddings. In this way, SAMCT can work in an automatic manner in addition to the semi-automatic interactive strategy in SAM. Extensive experiments demonstrate the superiority of SAMCT against the state-of-the-art task-specific and SAM-based medical foundation models on various tasks. The code, data, and models are released at https://github.com/xianlin7/SAMCT.

Title: Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations

Authors: Kewei Wang, Yizheng Wu, Jun Cen, Zhiyu Pan, Xingyi Li, Zhe Wang, Zhiguo Cao, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13261
Pdf URL: https://arxiv.org/pdf/2403.13261
Copy Paste: [[2403.13261]] Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations(https://arxiv.org/abs/2403.13261)
Keywords: self-supervised
Abstract: The perception of motion behavior in a dynamic environment holds significant importance for autonomous driving systems, wherein class-agnostic motion prediction methods directly predict the motion of the entire point cloud. While most existing methods rely on fully-supervised learning, the manual labeling of point cloud data is laborious and time-consuming. Therefore, several annotation-efficient methods have been proposed to address this challenge. Although effective, these methods rely on weak annotations or additional multi-modal data like images, and the potential benefits inherent in the point cloud sequence are still underexplored. To this end, we explore the feasibility of self-supervised motion prediction with only unlabeled LiDAR point clouds. Initially, we employ an optimal transport solver to establish coarse correspondences between current and future point clouds as the coarse pseudo motion labels. Training models directly using such coarse labels leads to noticeable spatial and temporal prediction inconsistencies. To mitigate these issues, we introduce three simple spatial and temporal regularization losses, which facilitate the self-supervised training process effectively. Experimental results demonstrate the significant superiority of our approach over the state-of-the-art self-supervised methods.

Title: Text-to-3D Shape Generation

Authors: Han-Hung Lee, Manolis Savva, Angel X. Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13289
Pdf URL: https://arxiv.org/pdf/2403.13289
Copy Paste: [[2403.13289]] Text-to-3D Shape Generation(https://arxiv.org/abs/2403.13289)
Keywords: generative
Abstract: Recent years have seen an explosion of work and interest in text-to-3D shape generation. Much of the progress is driven by advances in 3D representations, large-scale pretraining and representation learning for text and image data enabling generative AI models, and differentiable rendering. Computational systems that can perform text-to-3D shape generation have captivated the popular imagination as they enable non-expert users to easily create 3D content directly from text. However, there are still many limitations and challenges remaining in this problem space. In this state-of-the-art report, we provide a survey of the underlying technology and methods enabling text-to-3D shape generation to summarize the background literature. We then derive a systematic categorization of recent work on text-to-3D shape generation based on the type of supervision data required. Finally, we discuss limitations of the existing categories of methods, and delineate promising directions for future work.

Title: Building Optimal Neural Architectures using Interpretable Knowledge

Authors: Keith G. Mills, Fred X. Han, Mohammad Salameh, Shengyao Lu, Chunhua Zhou, Jiao He, Fengyu Sun, Di Niu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13293
Pdf URL: https://arxiv.org/pdf/2403.13293
Copy Paste: [[2403.13293]] Building Optimal Neural Architectures using Interpretable Knowledge(https://arxiv.org/abs/2403.13293)
Keywords: diffusion
Abstract: Neural Architecture Search is a costly practice. The fact that a search space can span a vast number of design choices with each architecture evaluation taking nontrivial overhead makes it hard for an algorithm to sufficiently explore candidate networks. In this paper, we propose AutoBuild, a scheme which learns to align the latent embeddings of operations and architecture modules with the ground-truth performance of the architectures they appear in. By doing so, AutoBuild is capable of assigning interpretable importance scores to architecture modules, such as individual operation features and larger macro operation sequences such that high-performance neural networks can be constructed without any need for search. Through experiments performed on state-of-the-art image classification, segmentation, and Stable Diffusion models, we show that by mining a relatively small set of evaluated architectures, AutoBuild can learn to build high-quality architectures directly or help to reduce search space to focus on relevant areas, finding better architectures that outperform both the original labeled ones and ones found by search baselines. Code available at https://github.com/Ascend-Research/AutoBuild

Title: DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

Authors: Yibo Wang, Ruiyuan Gao, Kai Chen, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit-Yan Yeung, Qiang Xu, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13304
Pdf URL: https://arxiv.org/pdf/2403.13304
Copy Paste: [[2403.13304]] DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception(https://arxiv.org/abs/2403.13304)
Keywords: diffusion, generative
Abstract: Current perceptive models heavily depend on resource-intensive datasets, prompting the need for innovative solutions. Leveraging recent advances in diffusion models, synthetic data, by constructing image inputs from various annotations, proves beneficial for downstream tasks. While prior methods have separately addressed generative and perceptive models, DetDiffusion, for the first time, harmonizes both, tackling the challenges in generating effective data for perceptive models. To enhance image generation with perceptive models, we introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability. To boost the performance of specific perceptive models, our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation. Experimental results from the object detection task highlight DetDiffusion's superior performance, establishing a new state-of-the-art in layout-guided generation. Furthermore, image syntheses from DetDiffusion can effectively augment training data, significantly enhancing downstream detection performance.

Title: LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment

Authors: Peishan Cong, Ziyi WangZhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, Yuexin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13307
Pdf URL: https://arxiv.org/pdf/2403.13307
Copy Paste: [[2403.13307]] LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment(https://arxiv.org/abs/2403.13307)
Keywords: diffusion
Abstract: Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.

Title: TiBiX: Leveraging Temporal Information for Bidirectional X-ray and Report Generation

Authors: Santosh Sanjeev, Fadillah Adamsyah Maani, Arsen Abzhanov, Vijay Ram Papineni, Ibrahim Almakky, Bartłomiej W. Papież, Mohammad Yaqub
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13343
Pdf URL: https://arxiv.org/pdf/2403.13343
Copy Paste: [[2403.13343]] TiBiX: Leveraging Temporal Information for Bidirectional X-ray and Report Generation(https://arxiv.org/abs/2403.13343)
Keywords: generative
Abstract: With the emergence of vision language models in the medical imaging domain, numerous studies have focused on two dominant research activities: (1) report generation from Chest X-rays (CXR), and (2) synthetic scan generation from text or reports. Despite some research incorporating multi-view CXRs into the generative process, prior patient scans and reports have been generally disregarded. This can inadvertently lead to the leaving out of important medical information, thus affecting generation quality. To address this, we propose TiBiX: Leveraging Temporal information for Bidirectional X-ray and Report Generation. Considering previous scans, our approach facilitates bidirectional generation, primarily addressing two challenging problems: (1) generating the current image from the previous image and current report and (2) generating the current report based on both the previous and current images. Moreover, we extract and release a curated temporal benchmark dataset derived from the MIMIC-CXR dataset, which focuses on temporal data. Our comprehensive experiments and ablation studies explore the merits of incorporating prior CXRs and achieve state-of-the-art (SOTA) results on the report generation task. Furthermore, we attain on-par performance with SOTA image generation efforts, thus serving as a new baseline in longitudinal bidirectional CXR-to-report generation. The code is available at https://github.com/BioMedIA-MBZUAI/TiBiX.

Title: Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection

Authors: Xincheng Yao, Ruoqi Li, Zefeng Qian, Lu Wang, Chongyang Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.13349
Pdf URL: https://arxiv.org/pdf/2403.13349
Copy Paste: [[2403.13349]] Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection(https://arxiv.org/abs/2403.13349)
Keywords: anomaly
Abstract: Unified anomaly detection (AD) is one of the most challenges for anomaly detection, where one unified model is trained with normal samples from multiple classes with the objective to detect anomalies in these classes. For such a challenging task, popular normalizing flow (NF) based AD methods may fall into a "homogeneous mapping" issue,where the NF-based AD models are biased to generate similar latent representations for both normal and abnormal features, and thereby lead to a high missing rate of anomalies. In this paper, we propose a novel Hierarchical Gaussian mixture normalizing flow modeling method for accomplishing unified Anomaly Detection, which we call HGAD. Our HGAD consists of two key components: inter-class Gaussian mixture modeling and intra-class mixed class centers learning. Compared to the previous NF-based AD methods, the hierarchical Gaussian mixture modeling approach can bring stronger representation capability to the latent space of normalizing flows, so that even complex multi-class distribution can be well represented and learned in the latent space. In this way, we can avoid mapping different class distributions into the same single Gaussian prior, thus effectively avoiding or mitigating the "homogeneous mapping" issue. We further indicate that the more distinguishable different class centers, the more conducive to avoiding the bias issue. Thus, we further propose a mutual information maximization loss for better structuring the latent feature space. We evaluate our method on four real-world AD benchmarks, where we can significantly improve the previous NF-based AD methods and also outperform the SOTA unified AD methods.

Title: AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

Authors: Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13352
Pdf URL: https://arxiv.org/pdf/2403.13352
Copy Paste: [[2403.13352]] AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation(https://arxiv.org/abs/2403.13352)
Keywords: diffusion
Abstract: Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.

Title: IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis

Authors: Feng Liu, Xiaobin-Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13378
Pdf URL: https://arxiv.org/pdf/2403.13378
Copy Paste: [[2403.13378]] IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis(https://arxiv.org/abs/2403.13378)
Keywords: diffusion, generative
Abstract: Semantic image synthesis aims to generate high-quality images given semantic conditions, i.e. segmentation masks and style reference images. Existing methods widely adopt generative adversarial networks (GANs). GANs take all conditional inputs and directly synthesize images in a single forward step. In this paper, semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model (IIDM). Specifically, the style reference is first contaminated with random noise and then progressively denoised by IIDM, guided by segmentation masks. Moreover, three techniques, refinement, color-transfer and model ensembles, are proposed to further boost the generation quality. They are plug-in inference modules and do not require additional training. Extensive experiments show that our IIDM outperforms existing state-of-the-art methods by clear margins. Further analysis is provided via detailed demonstrations. We have implemented IIDM based on the Jittor framework; code is available at https://github.com/ader47/jittor-jieke-semantic_images_synthesis.

Title: S2DM: Sector-Shaped Diffusion Models for Video Generation

Authors: Haoran Lang, Yuxuan Ge, Zheng Tian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.13408
Pdf URL: https://arxiv.org/pdf/2403.13408
Copy Paste: [[2403.13408]] S2DM: Sector-Shaped Diffusion Models for Video Generation(https://arxiv.org/abs/2403.13408)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at https://s2dm.github.io/S2DM/.

Title: MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining

Authors: Di Wang, Jing Zhang, Minqiang Xu, Lin Liu, Dongsheng Wang, Erzhong Gao, Chengxi Han, Haonan Guo, Bo Du, Dacheng Tao, Liangpei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13430
Pdf URL: https://arxiv.org/pdf/2403.13430
Copy Paste: [[2403.13430]] MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining(https://arxiv.org/abs/2403.13430)
Keywords: self-supervised, foundation model
Abstract: Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. Pretraining is an active research topic, encompassing supervised and self-supervised learning methods to initialize model weights effectively. However, transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. In this study, we explore the Multi-Task Pretraining (MTP) paradigm for RS foundation models to address this issue. Using a shared encoder and task-specific decoder architecture, we conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. MTP supports both convolutional neural networks and vision transformer foundation models with over 300 million parameters. The pretrained models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection. Extensive experiments across 14 datasets demonstrate the superiority of our models over existing ones of similar size and their competitive performance compared to larger state-of-the-art models, thus validating the effectiveness of MTP.

Title: Progressive trajectory matching for medical dataset distillation

Authors: Zhen Yu, Yang Liu, Qingchao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13469
Pdf URL: https://arxiv.org/pdf/2403.13469
Copy Paste: [[2403.13469]] Progressive trajectory matching for medical dataset distillation(https://arxiv.org/abs/2403.13469)
Keywords: foundation model
Abstract: It is essential but challenging to share medical image datasets due to privacy issues, which prohibit building foundation models and knowledge transfer. In this paper, we propose a novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets. Existing methods tackle only natural images by randomly matching parts of the training trajectories of the model parameters trained by the whole real datasets. However, through extensive experiments on medical image datasets, the training process is extremely unstable and achieves inferior distillation results. To solve these barriers, we propose to design a novel progressive trajectory matching strategy to improve the training stability for medical image dataset distillation. Additionally, it is observed that improved stability prevents the synthetic dataset diversity and final performance improvements. Therefore, we propose a dynamic overlap mitigation module that improves the synthetic dataset diversity by dynamically eliminating the overlap across different images and retraining parts of the synthetic images for better convergence. Finally, we propose a new medical image dataset distillation benchmark of various modalities and configurations to promote fair evaluations. It is validated that our proposed method achieves 8.33% improvement over previous state-of-the-art methods on average, and 11.7% improvement when ipc=2 (i.e., image per class is 2). Codes and benchmarks will be released.

Title: Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion

Authors: Lucas Nunes, Rodrigo Marcuzzi, Benedikt Mersch, Jens Behley, Cyrill Stachniss
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13470
Pdf URL: https://arxiv.org/pdf/2403.13470
Copy Paste: [[2403.13470]] Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion(https://arxiv.org/abs/2403.13470)
Keywords: diffusion, generative
Abstract: Computer vision techniques play a central role in the perception stack of autonomous vehicles. Such methods are employed to perceive the vehicle surroundings given sensor data. 3D LiDAR sensors are commonly used to collect sparse 3D point clouds from the scene. However, compared to human perception, such systems struggle to deduce the unseen parts of the scene given those sparse point clouds. In this matter, the scene completion task aims at predicting the gaps in the LiDAR measurements to achieve a more complete scene representation. Given the promising results of recent diffusion models as generative models for images, we propose extending them to achieve scene completion from a single 3D LiDAR scan. Previous works used diffusion models over range images extracted from LiDAR data, directly applying image-based diffusion methods. Distinctly, we propose to directly operate on the points, reformulating the noising and denoising diffusion process such that it can efficiently work at scene scale. Together with our approach, we propose a regularization loss to stabilize the noise predicted during the denoising process. Our experimental evaluation shows that our method can complete the scene given a single LiDAR scan as input, producing a scene with more details compared to state-of-the-art scene completion methods. We believe that our proposed diffusion process formulation can support further research in diffusion models applied to scene-scale point cloud data.

Title: VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

Authors: Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, Anna Khoreva
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2403.13501
Pdf URL: https://arxiv.org/pdf/2403.13501
Copy Paste: [[2403.13501]] VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis(https://arxiv.org/abs/2403.13501)
Keywords: diffusion, generative
Abstract: Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.

Title: REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning

Authors: Run He, Huiping Zhuang, Di Fang, Yizhu Chen, Kai Tong, Cen Chen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.13522
Pdf URL: https://arxiv.org/pdf/2403.13522
Copy Paste: [[2403.13522]] REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning(https://arxiv.org/abs/2403.13522)
Keywords: self-supervised
Abstract: Exemplar-free class-incremental learning (EFCIL) aims to mitigate catastrophic forgetting in class-incremental learning without available historical data. Compared with its counterpart (replay-based CIL) that stores historical samples, the EFCIL suffers more from forgetting issues under the exemplar-free constraint. In this paper, inspired by the recently developed analytic learning (AL) based CIL, we propose a representation enhanced analytic learning (REAL) for EFCIL. The REAL constructs a dual-stream base pretraining (DS-BPT) and a representation enhancing distillation (RED) process to enhance the representation of the extractor. The DS-BPT pretrains model in streams of both supervised learning and self-supervised contrastive learning (SSCL) for base knowledge extraction. The RED process distills the supervised knowledge to the SSCL pretrained backbone and facilitates a subsequent AL-basd CIL that converts the CIL to a recursive least-square problem. Our method addresses the issue of insufficient discriminability in representations of unseen data caused by a frozen backbone in the existing AL-based CIL. Empirical results on various datasets including CIFAR-100, ImageNet-100 and ImageNet-1k, demonstrate that our REAL outperforms the state-of-the-arts in EFCIL, and achieves comparable or even more superior performance compared with the replay-based methods.

Title: Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

Authors: Bowen Zhang, Tianyu Yang, Yu Li, Lei Zhang, Xi Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.13524
Pdf URL: https://arxiv.org/pdf/2403.13524
Copy Paste: [[2403.13524]] Compress3D: a Compressed Latent Space for 3D Generation from a Single Image(https://arxiv.org/abs/2403.13524)
Keywords: diffusion
Abstract: 3D generation has witnessed significant advancements, yet efficiently producing high-quality 3D assets from a single image remains challenging. In this paper, we present a triplane autoencoder, which encodes 3D models into a compact triplane latent space to effectively compress both the 3D geometry and texture information. Within the autoencoder framework, we introduce a 3D-aware cross-attention mechanism, which utilizes low-resolution latent representations to query features from a high-resolution 3D feature volume, thereby enhancing the representation capacity of the latent space. Subsequently, we train a diffusion model on this refined latent space. In contrast to solely relying on image embedding for 3D generation, our proposed method advocates for the simultaneous utilization of both image embedding and shape embedding as conditions. Specifically, the shape embedding is estimated via a diffusion prior model conditioned on the image embedding. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art algorithms, achieving superior performance while requiring less training data and time. Our approach enables the generation of high-quality 3D assets in merely 7 seconds on a single A100 GPU.

Title: IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Authors: Siying Cui, Jiankang Deng, Jia Guo, Xiang An, Yongle Zhao, Xinyu Wei, Ziyong Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13535
Pdf URL: https://arxiv.org/pdf/2403.13535
Copy Paste: [[2403.13535]] IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models(https://arxiv.org/abs/2403.13535)
Keywords: diffusion
Abstract: Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.

Title: Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

Authors: Hangeol Chang, Jinho Chang, Jong Chul Ye
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13551
Pdf URL: https://arxiv.org/pdf/2403.13551
Copy Paste: [[2403.13551]] Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing(https://arxiv.org/abs/2403.13551)
Keywords: diffusion
Abstract: Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.

Title: ReGround: Improving Textual and Spatial Grounding at No Cost

Authors: Yuseung Lee, Minhyuk Sung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13589
Pdf URL: https://arxiv.org/pdf/2403.13589
Copy Paste: [[2403.13589]] ReGround: Improving Textual and Spatial Grounding at No Cost(https://arxiv.org/abs/2403.13589)
Keywords: diffusion
Abstract: When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.

Title: Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Authors: Meet Doshi, Raj Dabre, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.13638
Pdf URL: https://arxiv.org/pdf/2403.13638
Copy Paste: [[2403.13638]] Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese(https://arxiv.org/abs/2403.13638)
Keywords: generative
Abstract: In this paper, we explore the utility of \textit{Translationese} as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56\% poorer on NLU tasks and 1.51\% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight \textit{TinyLMs} pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10\%) of clean data. We release the data we collected and created as a part of this work, \textit{IndicMonoDoc}, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.

Title: ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer

Authors: Hiroki Azuma, Yusuke Matsui, Atsuto Maki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13652
Pdf URL: https://arxiv.org/pdf/2403.13652
Copy Paste: [[2403.13652]] ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer(https://arxiv.org/abs/2403.13652)
Keywords: diffusion
Abstract: Deep learning models achieve high accuracy in segmentation tasks among others, yet domain shift often degrades the models' performance, which can be critical in real-world scenarios where no target images are available. This paper proposes a zero-shot domain adaptation method based on diffusion models, called ZoDi, which is two-fold by the design: zero-shot image transfer and model adaptation. First, we utilize an off-the-shelf diffusion model to synthesize target-like images by transferring the domain of source images to the target domain. In this we specifically try to maintain the layout and content by utilising layout-to-image diffusion models with stochastic inversion. Secondly, we train the model using both source images and synthesized images with the original segmentation maps while maximizing the feature similarity of images from the two domains to learn domain-robust representations. Through experiments we show benefits of ZoDi in the task of image segmentation over state-of-the-art methods. It is also more applicable than existing CLIP-based methods because it assumes no specific backbone or models, and it enables to estimate the model's performance without target images by inspecting generated images. Our implementation will be publicly available.

Title: DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Authors: Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, Jiebo Luo
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.13667
Pdf URL: https://arxiv.org/pdf/2403.13667
Copy Paste: [[2403.13667]] DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance(https://arxiv.org/abs/2403.13667)
Keywords: diffusion
Abstract: Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/Carmenw1203/DanceCamera3D-Official.

Title: PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case Documents

Authors: Mitodru Niyogi, Arnab Bhattacharya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13681
Pdf URL: https://arxiv.org/pdf/2403.13681
Copy Paste: [[2403.13681]] PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case Documents(https://arxiv.org/abs/2403.13681)
Keywords: generative
Abstract: In this paper, we present PARAMANU-AYN, a language model based exclusively on case documents of the Supreme Court of India, the Constitution of India, and the Indian Penal Code. The novel Auto Regressive (AR) decoder based model is pretrained from scratch at a context size of 8192. We evaluated our pretrained legal model on perplexity metrics. We also instruction-tuned our pretrained model on a set of 10,763 instructions covering various legal tasks such as legal reasoning, judgement explanation, legal clause generation, legal drafting, legal contract drafting, case summarization, constitutional question-answering, etc. We also evaluated the responses of prompts for instruction-tuned models by GPT-3.5-Turbo on clarity, relevance, completeness, and legal reasoning metrics in a scale of 10. Our model can be run on CPU and achieved 42.46 tokens/sec CPU inference speed. We found that our models, despite not being pretrained on legal books, various legal contracts, and legal documents, were able to learn the domain knowledge required for drafting various legal contracts and legal clauses, and generalize to draft legal contracts and legal clauses with limited instruction tuning. Hence, we conclude that for a strong domain-specialized generative language model (such as legal), very large amounts of data are not required to develop models from scratch. We believe that this work is the first attempt to make a dedicated generative legal language model from scratch for Indian Supreme Court jurisdiction or in legal NLP overall. We plan to release our Paramanu-Ayn model at https://www.bharatgpts.com.

Title: Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes

Authors: Yifan Chen, Mark Goldstein, Mengjian Hua, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.13724
Pdf URL: https://arxiv.org/pdf/2403.13724
Copy Paste: [[2403.13724]] Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes(https://arxiv.org/abs/2403.13724)
Keywords: diffusion, generative
Abstract: We propose a framework for probabilistic forecasting of dynamical systems based on generative modeling. Given observations of the system state over time, we formulate the forecasting problem as sampling from the conditional distribution of the future system state given its current state. To this end, we leverage the framework of stochastic interpolants, which facilitates the construction of a generative model between an arbitrary base distribution and the target. We design a fictitious, non-physical stochastic dynamics that takes as initial condition the current system state and produces as output a sample from the target conditional distribution in finite time and without bias. This process therefore maps a point mass centered at the current state onto a probabilistic ensemble of forecasts. We prove that the drift coefficient entering the stochastic differential equation (SDE) achieving this task is non-singular, and that it can be learned efficiently by square loss regression over the time-series data. We show that the drift and the diffusion coefficients of this SDE can be adjusted after training, and that a specific choice that minimizes the impact of the estimation error gives a F\"ollmer process. We highlight the utility of our approach on several complex, high-dimensional forecasting problems, including stochastically forced Navier-Stokes and video prediction on the KTH and CLEVRER datasets.

Title: Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Authors: Fu-Yun Wang, Xiaoshi Wu, Zhaoyang Huang, Xiaoyu Shi, Dazhong Shen, Guanglu Song, Yu Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13745
Pdf URL: https://arxiv.org/pdf/2403.13745
Copy Paste: [[2403.13745]] Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation(https://arxiv.org/abs/2403.13745)
Keywords: diffusion, generative
Abstract: Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA Mastering Video Outpainting Through Input-Specific Adaptation, a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies including spatial-aware insertion and noise travel are proposed to better leverage the diffusion model's generative prior and the acquired video patterns from source videos. Extensive evaluations underscore MOTIA's superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning.

Title: Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model

Authors: Diwei Wang, Kun Yuan, Candice Muller, Frédéric Blanc, Nicolas Padoy, Hyewon Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13756
Pdf URL: https://arxiv.org/pdf/2403.13756
Copy Paste: [[2403.13756]] Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model(https://arxiv.org/abs/2403.13756)
Keywords: generative
Abstract: We present a knowledge augmentation strategy for assessing the diagnostic groups and gait impairment from monocular gait videos. Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos, through a collective learning across three distinct modalities: gait videos, class-specific descriptions, and numerical gait parameters. Our specific contributions are two-fold: First, we adopt a knowledge-aware prompt tuning strategy to utilize the class-specific medical description in guiding the text prompt learning. Second, we integrate the paired gait parameters in the form of numerical texts to enhance the numeracy of the textual representation. Results demonstrate that our model not only significantly outperforms state-of-the-art (SOTA) in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions using the vocabulary of quantitative gait parameters. The code and the model will be made available at our project page.

Title: The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI

Authors: Matt White, Ibrahim Haddad, Cailean Osborne, Xiao-Yang (Yanglet)Liu, Ahmed Abdelmonsef, Sachin Varghese
Subjects: cs.LG, cs.AI, cs.CY, cs.SE
Abstract URL: https://arxiv.org/abs/2403.13784
Pdf URL: https://arxiv.org/pdf/2403.13784
Copy Paste: [[2403.13784]] The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI(https://arxiv.org/abs/2403.13784)
Keywords: generative
Abstract: Generative AI (GAI) offers unprecedented possibilities but its commercialization has raised concerns about transparency, reproducibility, bias, and safety. Many "open-source" GAI models lack the necessary components for full understanding and reproduction, and some use restrictive licenses, a practice known as "openwashing." We propose the Model Openness Framework (MOF), a ranked classification system that rates machine learning models based on their completeness and openness, following principles of open science, open source, open data, and open access. The MOF requires specific components of the model development lifecycle to be included and released under appropriate open licenses. This framework aims to prevent misrepresentation of models claiming to be open, guide researchers and developers in providing all model components under permissive licenses, and help companies, academia, and hobbyists identify models that can be safely adopted without restrictions. Wide adoption of the MOF will foster a more open AI ecosystem, accelerating research, innovation, and adoption.

Title: DepthFM: Fast Monocular Depth Estimation with Flow Matching

Authors: Ming Gui, Johannes S. Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13788
Pdf URL: https://arxiv.org/pdf/2403.13788
Copy Paste: [[2403.13788]] DepthFM: Fast Monocular Depth Estimation with Flow Matching(https://arxiv.org/abs/2403.13788)
Keywords: diffusion, generative
Abstract: Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.

Title: TimeRewind: Rewinding Time with Image-and-Events Video Diffusion

Authors: Jingxi Chen, Brandon Y. Feng, Haoming Cai, Mingyang Xie, Christopher Metzler, Cornelia Fermuller, Yiannis Aloimonos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.13800
Pdf URL: https://arxiv.org/pdf/2403.13800
Copy Paste: [[2403.13800]] TimeRewind: Rewinding Time with Image-and-Events Video Diffusion(https://arxiv.org/abs/2403.13800)
Keywords: diffusion, generative
Abstract: This paper addresses the novel challenge of ``rewinding'' time from a single captured image to recover the fleeting moments missed just before the shutter button is pressed. This problem poses a significant challenge in computer vision and computational photography, as it requires predicting plausible pre-capture motion from a single static frame, an inherently ill-posed task due to the high degree of freedom in potential pixel movements. We overcome this challenge by leveraging the emerging technology of neuromorphic event cameras, which capture motion information with high temporal resolution, and integrating this data with advanced image-to-video diffusion models. Our proposed framework introduces an event motion adaptor conditioned on event camera data, guiding the diffusion model to generate videos that are visually coherent and physically grounded in the captured events. Through extensive experimentation, we demonstrate the capability of our approach to synthesize high-quality videos that effectively ``rewind'' time, showcasing the potential of combining event camera technology with generative models. Our work opens new avenues for research at the intersection of computer vision, computational photography, and generative modeling, offering a forward-thinking solution to capturing missed moments and enhancing future consumer cameras and smartphones. Please see the project page at https://timerewind.github.io/ for video results and code release.

Title: ZigMa: Zigzag Mamba Diffusion Model

Authors: Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, Bjorn Ommer
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13802
Pdf URL: https://arxiv.org/pdf/2403.13802
Copy Paste: [[2403.13802]] ZigMa: Zigzag Mamba Diffusion Model(https://arxiv.org/abs/2403.13802)
Keywords: diffusion
Abstract: The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$. Code will be released at https://taohu.me/zigma/

Title: Editing Massive Concepts in Text-to-Image Diffusion Models

Authors: Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13807
Pdf URL: https://arxiv.org/pdf/2403.13807
Copy Paste: [[2403.13807]] Editing Massive Concepts in Text-to-Image Diffusion Models(https://arxiv.org/abs/2403.13807)
Keywords: diffusion
Abstract: Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.

Title: On Pretraining Data Diversity for Self-Supervised Learning

Authors: Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.13808
Pdf URL: https://arxiv.org/pdf/2403.13808
Copy Paste: [[2403.13808]] On Pretraining Data Diversity for Self-Supervised Learning(https://arxiv.org/abs/2403.13808)
Keywords: diffusion, self-supervised
Abstract: We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models will be available at https://github.com/hammoudhasan/DiversitySSL .