2024-09-27

Title: An Art-centric perspective on AI-based content moderation of nudity

Authors: Piera Riccio, Georgina Curto, Thomas Hofmann, Nuria Oliver
Subjects: cs.CV, cs.SI
Abstract URL: https://arxiv.org/abs/2409.17156
Pdf URL: https://arxiv.org/pdf/2409.17156
Copy Paste: [[2409.17156]] An Art-centric perspective on AI-based content moderation of nudity(https://arxiv.org/abs/2409.17156)
Keywords: generative
Abstract: At a time when the influence of generative Artificial Intelligence on visual arts is a highly debated topic, we raise the attention towards a more subtle phenomenon: the algorithmic censorship of artistic nudity online. We analyze the performance of three "Not-Safe-For-Work'' image classifiers on artistic nudity, and empirically uncover the existence of a gender and a stylistic bias, as well as evident technical limitations, especially when only considering visual information. Hence, we propose a multi-modal zero-shot classification approach that improves artistic nudity classification. From our research, we draw several implications that we hope will inform future research on this topic.

Title: Enhancing Guardrails for Safe and Secure Healthcare AI

Authors: Ananya Gangavarapu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2409.17190
Pdf URL: https://arxiv.org/pdf/2409.17190
Copy Paste: [[2409.17190]] Enhancing Guardrails for Safe and Secure Healthcare AI(https://arxiv.org/abs/2409.17190)
Keywords: generative
Abstract: Generative AI holds immense promise in addressing global healthcare access challenges, with numerous innovative applications now ready for use across various healthcare domains. However, a significant barrier to the widespread adoption of these domain-specific AI solutions is the lack of robust safety mechanisms to effectively manage issues such as hallucination, misinformation, and ensuring truthfulness. Left unchecked, these risks can compromise patient safety and erode trust in healthcare AI systems. While general-purpose frameworks like Llama Guard are useful for filtering toxicity and harmful content, they do not fully address the stringent requirements for truthfulness and safety in healthcare contexts. This paper examines the unique safety and security challenges inherent to healthcare AI, particularly the risk of hallucinations, the spread of misinformation, and the need for factual accuracy in clinical settings. I propose enhancements to existing guardrails frameworks, such as Nvidia NeMo Guardrails, to better suit healthcare-specific needs. By strengthening these safeguards, I aim to ensure the secure, reliable, and accurate use of AI in healthcare, mitigating misinformation risks and improving patient safety.

Title: A random measure approach to reinforcement learning in continuous time

Authors: Christian Bender, Nguyen Tran Thuan
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2409.17200
Pdf URL: https://arxiv.org/pdf/2409.17200
Copy Paste: [[2409.17200]] A random measure approach to reinforcement learning in continuous time(https://arxiv.org/abs/2409.17200)
Keywords: diffusion
Abstract: We present a random measure approach for modeling exploration, i.e., the execution of measure-valued controls, in continuous-time reinforcement learning (RL) with controlled diffusion and jumps. First, we consider the case when sampling the randomized control in continuous time takes place on a discrete-time grid and reformulate the resulting stochastic differential equation (SDE) as an equation driven by suitable random measures. The construction of these random measures makes use of the Brownian motion and the Poisson random measure (which are the sources of noise in the original model dynamics) as well as the additional random variables, which are sampled on the grid for the control execution. Then, we prove a limit theorem for these random measures as the mesh-size of the sampling grid goes to zero, which leads to the grid-sampling limit SDE that is jointly driven by white noise random measures and a Poisson random measure. We also argue that the grid-sampling limit SDE can substitute the exploratory SDE and the sample SDE of the recent continuous-time RL literature, i.e., it can be applied for the theoretical analysis of exploratory control problems and for the derivation of learning algorithms.

Title: 2024 BRAVO Challenge Track 1 1st Place Report: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation

Authors: Tommie Kerssies, Daan de Geus, Gijs Dubbelman
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2409.17208
Pdf URL: https://arxiv.org/pdf/2409.17208
Copy Paste: [[2409.17208]] 2024 BRAVO Challenge Track 1 1st Place Report: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation(https://arxiv.org/abs/2409.17208)
Keywords: foundation model
Abstract: In this report, we present our solution for Track 1 of the 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves 1st place in the challenge. Our code is publicly available at this https URL.

Title: Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Authors: Mattia Segu, Luigi Piccinelli, Siyuan Li, Luc Van Gool, Fisher Yu, Bernt Schiele
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17221
Pdf URL: https://arxiv.org/pdf/2409.17221
Copy Paste: [[2409.17221]] Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs(https://arxiv.org/abs/2409.17221)
Keywords: self-supervised
Abstract: The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.

Title: Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Authors: Hui En Pang, Shuai Liu, Zhongang Cai, Lei Yang, Tianwei Zhang, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17280
Pdf URL: https://arxiv.org/pdf/2409.17280
Copy Paste: [[2409.17280]] Disco4D: Disentangled 4D Human Generation and Animation from a Single Image(https://arxiv.org/abs/2409.17280)
Keywords: diffusion
Abstract: We present \textbf{Disco4D}, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. Different from existing methods, Disco4D distinctively disentangles clothings (with Gaussian models) from the human body (with SMPL-X model), significantly enhancing the generation details and flexibility. It has the following technical innovations. \textbf{1)} Disco4D learns to efficiently fit the clothing Gaussians over the SMPL-X Gaussians. \textbf{2)} It adopts diffusion models to enhance the 3D generation process, \textit{e.g.}, modeling occluded parts not visible in the input image. \textbf{3)} It learns an identity encoding for each clothing Gaussian to facilitate the separation and extraction of clothing assets. Furthermore, Disco4D naturally supports 4D human animation with vivid dynamics. Extensive experiments demonstrate the superiority of Disco4D on 4D human generation and animation tasks. Our visualizations can be found in \url{this https URL}.

Title: Consistent estimation of generative model representations in the data kernel perspective space

Authors: Aranyak Acharyya, Michael W. Trosset, Carey E. Priebe, Hayden S. Helm
Subjects: cs.LG, math.ST
Abstract URL: https://arxiv.org/abs/2409.17308
Pdf URL: https://arxiv.org/pdf/2409.17308
Copy Paste: [[2409.17308]] Consistent estimation of generative model representations in the data kernel perspective space(https://arxiv.org/abs/2409.17308)
Keywords: diffusion, generative
Abstract: Generative models, such as large language models and text-to-image diffusion models, produce relevant information when presented a query. Different models may produce different information when presented the same query. As the landscape of generative models evolves, it is important to develop techniques to study and analyze differences in model behaviour. In this paper we present novel theoretical results for embedding-based representations of generative models in the context of a set of queries. We establish sufficient conditions for the consistent estimation of the model embeddings in situations where the query set and the number of models grow.

Title: KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation

Authors: Anantaa Kotal, Anupam Joshi
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2409.17315
Pdf URL: https://arxiv.org/pdf/2409.17315
Copy Paste: [[2409.17315]] KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation(https://arxiv.org/abs/2409.17315)
Keywords: generative
Abstract: The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model's capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.

Title: VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Authors: Liangyu Zhong, Joachim Sicking, Fabian Hüger, Hanno Gottschalk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17330
Pdf URL: https://arxiv.org/pdf/2409.17330
Copy Paste: [[2409.17330]] VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection(https://arxiv.org/abs/2409.17330)
Keywords: anomaly
Abstract: Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.

Title: Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting

Authors: Jay Zoellin, Colin Merk, Mischa Buob, Amr Saad, Samuel Giesser, Tahm Spitznagel, Ferhat Turgut, Rui Santos, Yukun Zhou, Sigfried Wagner, Pearse A. Keane, Yih Chung Tham, Delia Cabrera DeBuc, Matthias D. Becker, Gabor M. Somfai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.17332
Pdf URL: https://arxiv.org/pdf/2409.17332
Copy Paste: [[2409.17332]] Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting(https://arxiv.org/abs/2409.17332)
Keywords: self-supervised, foundation model
Abstract: Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.

Title: Trading through Earnings Seasons using Self-Supervised Contrastive Representation Learning

Authors: Zhengxin Joseph Ye, Bjoern Schuller
Subjects: cs.LG, q-fin.TR
Abstract URL: https://arxiv.org/abs/2409.17392
Pdf URL: https://arxiv.org/pdf/2409.17392
Copy Paste: [[2409.17392]] Trading through Earnings Seasons using Self-Supervised Contrastive Representation Learning(https://arxiv.org/abs/2409.17392)
Keywords: self-supervised
Abstract: Earnings release is a key economic event in the financial markets and crucial for predicting stock movements. Earnings data gives a glimpse into how a company is doing financially and can hint at where its stock might go next. However, the irregularity of its release cycle makes it a challenge to incorporate this data in a medium-frequency algorithmic trading model and the usefulness of this data fades fast after it is released, making it tough for models to stay accurate over time. Addressing this challenge, we introduce the Contrastive Earnings Transformer (CET) model, a self-supervised learning approach rooted in Contrastive Predictive Coding (CPC), aiming to optimise the utilisation of earnings data. To ascertain its effectiveness, we conduct a comparative study of CET against benchmark models across diverse sectors. Our research delves deep into the intricacies of stock data, evaluating how various models, and notably CET, handle the rapidly changing relevance of earnings data over time and over different sectors. The research outcomes shed light on CET's distinct advantage in extrapolating the inherent value of earnings data over time. Its foundation on CPC allows for a nuanced understanding, facilitating consistent stock predictions even as the earnings data ages. This finding about CET presents a fresh approach to better use earnings data in algorithmic trading for predicting stock price trends.

Title: Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

Authors: Chirag Vashist, Shichong Peng, Ke Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.17439
Pdf URL: https://arxiv.org/pdf/2409.17439
Copy Paste: [[2409.17439]] Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis(https://arxiv.org/abs/2409.17439)
Keywords: diffusion, generative
Abstract: An emerging area of research aims to learn deep generative models with limited training data. Prior generative models like GANs and diffusion models require a lot of data to perform well, and their performance degrades when they are trained on only a small amount of data. A recent technique called Implicit Maximum Likelihood Estimation (IMLE) has been adapted to the few-shot setting, achieving state-of-the-art performance. However, current IMLE-based approaches encounter challenges due to inadequate correspondence between the latent codes selected for training and those drawn during inference. This results in suboptimal test-time performance. We theoretically show a way to address this issue and propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher quality image generation compared to existing GAN and IMLE-based methods, as validated by comprehensive experiments conducted on nine few-shot image datasets.

Title: CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Authors: Sifan Wu, Amir Khasahmadi, Mor Katz, Pradeep Kumar Jayaraman, Yewen Pu, Karl Willis, Bang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.17457
Pdf URL: https://arxiv.org/pdf/2409.17457
Copy Paste: [[2409.17457]] CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches(https://arxiv.org/abs/2409.17457)
Keywords: foundation model, generative
Abstract: Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

Title: Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection

Authors: Yi Gu, Yi Lin, Kwang-Ting Cheng, Hao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17485
Pdf URL: https://arxiv.org/pdf/2409.17485
Copy Paste: [[2409.17485]] Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection(https://arxiv.org/abs/2409.17485)
Keywords: anomaly
Abstract: Medical anomaly detection (AD) is crucial in pathological identification and localization. Current methods typically rely on uncertainty estimation in deep ensembles to detect anomalies, assuming that ensemble learners should agree on normal samples while exhibiting disagreement on unseen anomalies in the output space. However, these methods may suffer from inadequate disagreement on anomalies or diminished agreement on normal samples. To tackle these issues, we propose D2UE, a Diversified Dual-space Uncertainty Estimation framework for medical anomaly detection. To effectively balance agreement and disagreement for anomaly detection, we propose Redundancy-Aware Repulsion (RAR), which uses a similarity kernel that remains invariant to both isotropic scaling and orthogonal transformations, explicitly promoting diversity in learners' feature space. Moreover, to accentuate anomalous regions, we develop Dual-Space Uncertainty (DSU), which utilizes the ensemble's uncertainty in input and output spaces. In input space, we first calculate gradients of reconstruction error with respect to input images. The gradients are then integrated with reconstruction outputs to estimate uncertainty for inputs, enabling effective anomaly discrimination even when output space disagreement is minimal. We conduct a comprehensive evaluation of five medical benchmarks with different backbones. Experimental results demonstrate the superiority of our method to state-of-the-art methods and the effectiveness of each component in our framework. Our code is available at this https URL.

Title: Learning Quantized Adaptive Conditions for Diffusion Models

Authors: Yuchen Liang, Yuchuan Tian, Lei Yu, Huao Tang, Jie Hu, Xiangzhong Fang, Hanting Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17487
Pdf URL: https://arxiv.org/pdf/2409.17487
Copy Paste: [[2409.17487]] Learning Quantized Adaptive Conditions for Diffusion Models(https://arxiv.org/abs/2409.17487)
Keywords: diffusion
Abstract: The curvature of ODE trajectories in diffusion models hinders their ability to generate high-quality images in a few number of function evaluations (NFE). In this paper, we propose a novel and effective approach to reduce trajectory curvature by utilizing adaptive conditions. By employing a extremely light-weight quantized encoder, our method incurs only an additional 1% of training parameters, eliminates the need for extra regularization terms, yet achieves significantly better sample quality. Our approach accelerates ODE sampling while preserving the downstream task image editing capabilities of SDE techniques. Extensive experiments verify that our method can generate high quality results under extremely limited sampling costs. With only 6 NFE, we achieve 5.14 FID on CIFAR-10, 6.91 FID on FFHQ 64x64 and 3.10 FID on AFHQv2.

Title: Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

Authors: Xun Zhu, Ying Hu, Fanbin Mo, Miao Li, Ji Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.17508
Pdf URL: https://arxiv.org/pdf/2409.17508
Copy Paste: [[2409.17508]] Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE(https://arxiv.org/abs/2409.17508)
Keywords: foundation model
Abstract: Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code, data and model will be soon available at GitHub.

Title: JoyType: A Robust Design for Multilingual Visual Text Creation

Authors: Chao Li, Chen Jiang, Xiaolong Liu, Jun Zhao, Guoxin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17524
Pdf URL: https://arxiv.org/pdf/2409.17524
Copy Paste: [[2409.17524]] JoyType: A Robust Design for Multilingual Visual Text Creation(https://arxiv.org/abs/2409.17524)
Keywords: diffusion
Abstract: Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on this https URL.

Title: A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Authors: Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
Subjects: cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.17550
Pdf URL: https://arxiv.org/pdf/2409.17550
Copy Paste: [[2409.17550]] A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation(https://arxiv.org/abs/2409.17550)
Keywords: diffusion
Abstract: In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods.

Title: Pixel-Space Post-Training of Latent Diffusion Models

Authors: Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu, Sam Tsai, Peter Vajda, Zijian He, Jialiang Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.17565
Pdf URL: https://arxiv.org/pdf/2409.17565
Copy Paste: [[2409.17565]] Pixel-Space Post-Training of Latent Diffusion Models(https://arxiv.org/abs/2409.17565)
Keywords: diffusion
Abstract: Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically $8 \times 8$ lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

Title: Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

Authors: Hongtao Huang, Xiaojun Chang, Lina Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17566
Pdf URL: https://arxiv.org/pdf/2409.17566
Copy Paste: [[2409.17566]] Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule(https://arxiv.org/abs/2409.17566)
Keywords: diffusion, generative
Abstract: Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images. Despite their effectiveness, these models often require significant computational resources owing to their numerous sequential denoising steps and the significant inference cost of each step. Recently, Neural Architecture Search (NAS) techniques have been employed to automatically search for faster generation processes. However, NAS for diffusion is inherently time-consuming as it requires estimating thousands of diffusion models to search for the optimal one. In this paper, we introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models by concurrently optimizing generation steps and network structures. Specifically, we partition the generation process into isometric step segments, each sequentially composed of a full step, multiple partial steps, and several null steps. The full step computes all network blocks, while the partial step involves part of the blocks, and the null step entails no computation. Flexiffusion autonomously explores flexible step combinations for each segment, substantially reducing search costs and enabling greater acceleration compared to the state-of-the-art (SOTA) method for diffusion models. Our searched models reported speedup factors of $2.6\times$ and $1.5\times$ for the original LDM-4-G and the SOTA, respectively. The factors for Stable Diffusion V1.5 and the SOTA are $5.1\times$ and $2.0\times$. We also verified the performance of Flexiffusion on multiple datasets, and positive experiment results indicate that Flexiffusion can effectively reduce redundancy in diffusion models.

Title: ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition

Authors: Shen Li, Jianqing Xu, Jiaying Wu, Miao Xiong, Ailin Deng, Jiazhen Ji, Yuge Huang, Wenjie Feng, Shouhong Ding, Bryan Hooi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17576
Pdf URL: https://arxiv.org/pdf/2409.17576
Copy Paste: [[2409.17576]] ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition(https://arxiv.org/abs/2409.17576)
Keywords: diffusion
Abstract: Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed $\text{ID}^3$. $\text{ID}^3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of $\text{ID}^3$.

Title: RmGPT: Rotating Machinery Generative Pretrained Model

Authors: Yilin Wang, Yifei Yu, Kong Sun, Peixuan Lei, Yuxuan Zhang, Enrico Zio, Aiguo Xia, Yuanxiang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.17604
Pdf URL: https://arxiv.org/pdf/2409.17604
Copy Paste: [[2409.17604]] RmGPT: Rotating Machinery Generative Pretrained Model(https://arxiv.org/abs/2409.17604)
Keywords: self-supervised, foundation model, generative
Abstract: In industry, the reliability of rotating machinery is critical for production efficiency and safety. Current methods of Prognostics and Health Management (PHM) often rely on task-specific models, which face significant challenges in handling diverse datasets with varying signal characteristics, fault modes and operating conditions. Inspired by advancements in generative pretrained models, we propose RmGPT, a unified model for diagnosis and prognosis tasks. RmGPT introduces a novel token-based framework, incorporating Signal Tokens, Prompt Tokens, Time-Frequency Task Tokens and Fault Tokens to handle heterogeneous data within a unified model architecture. We leverage self-supervised learning for robust feature extraction and introduce a next signal token prediction pretraining strategy, alongside efficient prompt learning for task-specific adaptation. Extensive experiments demonstrate that RmGPT significantly outperforms state-of-the-art algorithms, achieving near-perfect accuracy in diagnosis tasks and exceptionally low errors in prognosis tasks. Notably, RmGPT excels in few-shot learning scenarios, achieving 92% accuracy in 16-class one-shot experiments, highlighting its adaptability and robustness. This work establishes RmGPT as a powerful PHM foundation model for rotating machinery, advancing the scalability and generalizability of PHM solutions.

Title: Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

Authors: Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Shuangli Du, Cheng Shi, Zhiyong Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17608
Pdf URL: https://arxiv.org/pdf/2409.17608
Copy Paste: [[2409.17608]] Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection(https://arxiv.org/abs/2409.17608)
Keywords: anomaly
Abstract: Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.

Title: ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Authors: Zhangpu Li, Changhong Zou, Suxue Ma, Zhicheng Yang, Chen Du, Youbao Tang, Zhenjie Cao, Ning Zhang, Jui-Hsin Lai, Ruei-Sung Lin, Yuan Ni, Xingzhi Sun, Jing Xiao, Kai Zhang, Mei Han
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2409.17610
Pdf URL: https://arxiv.org/pdf/2409.17610
Copy Paste: [[2409.17610]] ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue(https://arxiv.org/abs/2409.17610)
Keywords: in-context
Abstract: The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.

Title: Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Authors: Huan Yang, Jiahui Chen, Chaofan Ding, Runhua Shi, Siyu Xiong, Qingqi Hong, Xiaoqi Mo, Xinhan Di
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17674
Pdf URL: https://arxiv.org/pdf/2409.17674
Copy Paste: [[2409.17674]] Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation(https://arxiv.org/abs/2409.17674)
Keywords: diffusion, self-supervised
Abstract: Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.

Title: Dark Miner: Defend against unsafe generation for text-to-image diffusion models

Authors: Zheling Meng, Bo Peng, Xiaochuan Jin, Yue Jiang, Jing Dong, Wei Wang, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17682
Pdf URL: https://arxiv.org/pdf/2409.17682
Copy Paste: [[2409.17682]] Dark Miner: Defend against unsafe generation for text-to-image diffusion models(https://arxiv.org/abs/2409.17682)
Keywords: diffusion
Abstract: Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model's native generation capability. Our code will be available on GitHub.

Title: MIO: A Foundation Model on Multimodal Tokens

Authors: Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.17692
Pdf URL: https://arxiv.org/pdf/2409.17692
Copy Paste: [[2409.17692]] MIO: A Foundation Model on Multimodal Tokens(https://arxiv.org/abs/2409.17692)
Keywords: foundation model
Abstract: In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

Title: AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status

Authors: Jinghao Zhang, Wen Qian, Hao Luo, Fan Wang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17740
Pdf URL: https://arxiv.org/pdf/2409.17740
Copy Paste: [[2409.17740]] AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status(https://arxiv.org/abs/2409.17740)
Keywords: diffusion
Abstract: Diffusion models have made compelling progress on facilitating high-throughput daily production. Nevertheless, the appealing customized requirements are remain suffered from instance-level finetuning for authentic fidelity. Prior zero-shot customization works achieve the semantic consistence through the condensed injection of identity features, while addressing detailed low-level signatures through complex model configurations and subject-specific fabrications, which significantly break the statistical coherence within the overall system and limit the applicability across various scenarios. To facilitate the generic signature concentration with rectified efficiency, we present \textbf{AnyLogo}, a zero-shot region customizer with remarkable detail consistency, building upon the symbiotic diffusion system with eliminated cumbersome designs. Streamlined as vanilla image generation, we discern that the rigorous signature extraction and creative content generation are promisingly compatible and can be systematically recycled within a single denoising model. In place of the external configurations, the gemini status of the denoising model promote the reinforced subject transmission efficiency and disentangled semantic-signature space with continuous signature decoration. Moreover, the sparse recycling paradigm is adopted to prevent the duplicated risk with compressed transmission quota for diversified signature stimulation. Extensive experiments on constructed logo-level benchmarks demonstrate the effectiveness and practicability of our methods.

Title: Text Image Generation for Low-Resource Languages with Dual Translation Learning

Authors: Chihiro Noguchi, Shun Fukuda, Shoichiro Mihara, Masao Yamanaka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17747
Pdf URL: https://arxiv.org/pdf/2409.17747
Copy Paste: [[2409.17747]] Text Image Generation for Low-Resource Languages with Dual Translation Learning(https://arxiv.org/abs/2409.17747)
Keywords: diffusion
Abstract: Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.

Title: Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs

Authors: Qinpeng Cui, Yixuan Liu, Xinyi Zhang, Qiqi Bao, Zhongdao Wang, Qingmin Liao, Li Wang, Tian Lu, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17778
Pdf URL: https://arxiv.org/pdf/2409.17778
Copy Paste: [[2409.17778]] Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs(https://arxiv.org/abs/2409.17778)
Keywords: diffusion, generative
Abstract: Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency. Code: this https URL.

Title: Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Authors: Jian Li, Haojing Huang, Yujia Zhang, Pengfei Xu, Xi Chen, Rui Song, Lida Shi, Jingwen Wang, Hao Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.17791
Pdf URL: https://arxiv.org/pdf/2409.17791
Copy Paste: [[2409.17791]] Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness(https://arxiv.org/abs/2409.17791)
Keywords: self-supervised
Abstract: Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at this https URL.

Title: Continual learning with task specialist

Authors: Indu Solomon, Aye Phyu Phyu Aung, Uttam Kumar, Senthilnath Jayavelu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.17806
Pdf URL: https://arxiv.org/pdf/2409.17806
Copy Paste: [[2409.17806]] Continual learning with task specialist(https://arxiv.org/abs/2409.17806)
Keywords: diffusion
Abstract: Continual learning (CL) adapt the deep learning scenarios with timely updated datasets. However, existing CL models suffer from the catastrophic forgetting issue, where new knowledge replaces past learning. In this paper, we propose Continual Learning with Task Specialists (CLTS) to address the issues of catastrophic forgetting and limited labelled data in real-world datasets by performing class incremental learning of the incoming stream of data. The model consists of Task Specialists (T S) and Task Predictor (T P ) with pre-trained Stable Diffusion (SD) module. Here, we introduce a new specialist to handle a new task sequence and each T S has three blocks; i) a variational autoencoder (V AE) to learn the task distribution in a low dimensional latent space, ii) a K-Means block to perform data clustering and iii) Bootstrapping Language-Image Pre-training (BLIP ) model to generate a small batch of captions from the input data. These captions are fed as input to the pre-trained stable diffusion model (SD) for the generation of task samples. The proposed model does not store any task samples for replay, instead uses generated samples from SD to train the T P module. A comparison study with four SOTA models conducted on three real-world datasets shows that the proposed model outperforms all the selected baselines

Title: Ordinary Differential Equations for Enhanced 12-Lead ECG Generation

Authors: Yakir Yehuda, Kira Radinsky
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.17833
Pdf URL: https://arxiv.org/pdf/2409.17833
Copy Paste: [[2409.17833]] Ordinary Differential Equations for Enhanced 12-Lead ECG Generation(https://arxiv.org/abs/2409.17833)
Keywords: generative
Abstract: In the realm of artificial intelligence, the generation of realistic training data for supervised learning tasks presents a significant challenge. This is particularly true in the synthesis of electrocardiograms (ECGs), where the objective is to develop a synthetic 12-lead ECG model. The primary complexity of this task stems from accurately modeling the intricate biological and physiological interactions among different ECG leads. Although mathematical process simulators have shed light on these dynamics, effectively incorporating this understanding into generative models is not straightforward. In this work, we introduce an innovative method that employs ordinary differential equations (ODEs) to enhance the fidelity of generating 12-lead ECG data. This approach integrates a system of ODEs that represent cardiac dynamics directly into the generative model's optimization process, allowing for the production of biologically plausible ECG training data that authentically reflects real-world variability and inter-lead dependencies. We conducted an empirical analysis of thousands of ECGs and found that incorporating cardiac simulation insights into the data generation process significantly improves the accuracy of heart abnormality classifiers trained on this synthetic 12-lead ECG data.

Title: Machine Learning-based vs Deep Learning-based Anomaly Detection in Multivariate Time Series for Spacecraft Attitude Sensors

Authors: R. Gallon, F. Schiemenz, A. Krstova, A. Menicucci, E. Gill
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.17841
Pdf URL: https://arxiv.org/pdf/2409.17841
Copy Paste: [[2409.17841]] Machine Learning-based vs Deep Learning-based Anomaly Detection in Multivariate Time Series for Spacecraft Attitude Sensors(https://arxiv.org/abs/2409.17841)
Keywords: anomaly
Abstract: In the framework of Failure Detection, Isolation and Recovery (FDIR) on spacecraft, new AI-based approaches are emerging in the state of the art to overcome the limitations commonly imposed by traditional threshold checking. The present research aims at characterizing two different approaches to the problem of stuck values detection in multivariate time series coming from spacecraft attitude sensors. The analysis reveals the performance differences in the two approaches, while commenting on their interpretability and generalization to different scenarios.

Title: Self-supervised Monocular Depth Estimation with Large Kernel Attention

Authors: Xuezhi Xiang, Yao Wang, Lei Zhang, Denis Ombati, Himaloy Himu, Xiantong Zhen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17895
Pdf URL: https://arxiv.org/pdf/2409.17895
Copy Paste: [[2409.17895]] Self-supervised Monocular Depth Estimation with Large Kernel Attention(https://arxiv.org/abs/2409.17895)
Keywords: self-supervised
Abstract: Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.

Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Authors: Guokan Shang, Hadi Abdine, Yousef Khoubrane, Amr Mohamed, Yassine Abbahaddou, Sofiane Ennadir, Imane Momayiz, Xuguang Ren, Eric Moulines, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.17912
Pdf URL: https://arxiv.org/pdf/2409.17912
Copy Paste: [[2409.17912]] Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect(https://arxiv.org/abs/2409.17912)
Keywords: generative
Abstract: We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.

Title: WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

Authors: Dmytro Kotovenko, Olga Grebenkova, Nikolaos Sarafianos, Avinash Paliwal, Pingchuan Ma, Omid Poursaeed, Sreyas Mohan, Yuchen Fan, Yilei Li, Rakesh Ranjan, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17917
Pdf URL: https://arxiv.org/pdf/2409.17917
Copy Paste: [[2409.17917]] WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians(https://arxiv.org/abs/2409.17917)
Keywords: generative
Abstract: While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Splatting (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover's Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques. See our project page for additional results and source code: $\href{this https URL}{this https URL}$.

Title: Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

Authors: Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17920
Pdf URL: https://arxiv.org/pdf/2409.17920
Copy Paste: [[2409.17920]] Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation(https://arxiv.org/abs/2409.17920)
Keywords: diffusion
Abstract: Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code is available at this https URL.

Title: Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion

Authors: Hengrui Gu, Kaixiong Zhou, Yili Wang, Ruobing Wang, Xin Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.17928
Pdf URL: https://arxiv.org/pdf/2409.17928
Copy Paste: [[2409.17928]] Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion(https://arxiv.org/abs/2409.17928)
Keywords: diffusion, in-context
Abstract: During pre-training, the Text-to-Image (T2I) diffusion models encode factual knowledge into their parameters. These parameterized facts enable realistic image generation, but they may become obsolete over time, thereby misrepresenting the current state of the world. Knowledge editing techniques aim to update model knowledge in a targeted way. However, facing the dual challenges posed by inadequate editing datasets and unreliable evaluation criterion, the development of T2I knowledge editing encounter difficulties in effectively generalizing injected knowledge. In this work, we design a T2I knowledge editing framework by comprehensively spanning on three phases: First, we curate a dataset \textbf{CAKE}, comprising paraphrase and multi-object test, to enable more fine-grained assessment on knowledge generalization. Second, we propose a novel criterion, \textbf{adaptive CLIP threshold}, to effectively filter out false successful images under the current criterion and achieve reliable editing evaluation. Finally, we introduce \textbf{MPE}, a simple but effective approach for T2I knowledge editing. Instead of tuning parameters, MPE precisely recognizes and edits the outdated part of the conditioning text-prompt to accommodate the up-to-date knowledge. A straightforward implementation of MPE (Based on in-context learning) exhibits better overall performance than previous model editors. We hope these efforts can further promote faithful evaluation of T2I knowledge editing methods.

Title: Perturb, Attend, Detect and Localize (PADL): Robust Proactive Image Defense

Authors: Filippo Bartolucci, Iacopo Masi, Giuseppe Lisanti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17941
Pdf URL: https://arxiv.org/pdf/2409.17941
Copy Paste: [[2409.17941]] Perturb, Attend, Detect and Localize (PADL): Robust Proactive Image Defense(https://arxiv.org/abs/2409.17941)
Keywords: diffusion, generative
Abstract: Image manipulation detection and localization have received considerable attention from the research community given the blooming of Generative Models (GMs). Detection methods that follow a passive approach may overfit to specific GMs, limiting their application in real-world scenarios, due to the growing diversity of generative models. Recently, approaches based on a proactive framework have shown the possibility of dealing with this limitation. However, these methods suffer from two main limitations, which raises concerns about potential vulnerabilities: i) the manipulation detector is not robust to noise and hence can be easily fooled; ii) the fact that they rely on fixed perturbations for image protection offers a predictable exploit for malicious attackers, enabling them to reverse-engineer and evade detection. To overcome this issue we propose PADL, a new solution able to generate image-specific perturbations using a symmetric scheme of encoding and decoding based on cross-attention, which drastically reduces the possibility of reverse engineering, even when evaluated with adaptive attack [31]. Additionally, PADL is able to pinpoint manipulated areas, facilitating the identification of specific regions that have undergone alterations, and has more generalization power than prior art on held-out generative models. Indeed, although being trained only on an attribute manipulation GAN model [15], our method generalizes to a range of unseen models with diverse architectural designs, such as StarGANv2, BlendGAN, DiffAE, StableDiffusion and StableDiffusionXL. Additionally, we introduce a novel evaluation protocol, which offers a fair evaluation of localisation performance in function of detection accuracy and better captures real-world scenarios.

Title: Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

Authors: Xinpeng Yin, Wenming Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17951
Pdf URL: https://arxiv.org/pdf/2409.17951
Copy Paste: [[2409.17951]] Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition(https://arxiv.org/abs/2409.17951)
Keywords: self-supervised
Abstract: In self-supervised skeleton-based action recognition, the mask reconstruction paradigm is gaining interest in enhancing model refinement and robustness through effective masking. However, previous works primarily relied on a single masking criterion, resulting in the model overfitting specific features and overlooking other effective information. In this paper, we introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives. Specifically, in spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons, employing joint hierarchy as the masking criterion. In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective. Additionally, we incorporate cross-contrast loss based on the cross-masking framework into the loss function to enhance the model's learning of instance-level features. HA-CM shows efficiency and universality on three public large-scale datasets, NTU-60, NTU-120, and PKU-MMD. The source code of our HA-CM is available at this https URL.

Title: CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors

Authors: Linye Lyu, Jiawei Zhou, Daojing He, Yu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17963
Pdf URL: https://arxiv.org/pdf/2409.17963
Copy Paste: [[2409.17963]] CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors(https://arxiv.org/abs/2409.17963)
Keywords: diffusion
Abstract: Prior works on physical adversarial camouflage against vehicle detectors mainly focus on the effectiveness and robustness of the attack. The current most successful methods optimize 3D vehicle texture at a pixel level. However, this results in conspicuous and attention-grabbing patterns in the generated camouflage, which humans can easily identify. To address this issue, we propose a Customizable and Natural Camouflage Attack (CNCA) method by leveraging an off-the-shelf pre-trained diffusion model. By sampling the optimal texture image from the diffusion model with a user-specific text prompt, our method can generate natural and customizable adversarial camouflage while maintaining high attack performance. With extensive experiments on the digital and physical worlds and user studies, the results demonstrate that our proposed method can generate significantly more natural-looking camouflage than the state-of-the-art baselines while achieving competitive attack performance. Our code is available at \href{this https URL}{this https URL}

Title: LLM4Brain: Training a Large Language Model for Brain Video Understanding

Authors: Ruizhe Zheng, Lichao Sun
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2409.17987
Pdf URL: https://arxiv.org/pdf/2409.17987
Copy Paste: [[2409.17987]] LLM4Brain: Training a Large Language Model for Brain Video Understanding(https://arxiv.org/abs/2409.17987)
Keywords: self-supervised
Abstract: Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.

Title: InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction

Authors: Junchen Yu, Si-Yuan Cao, Runmin Zhang, Chenghao Zhang, Jianxin Hu, Zhu Yu, Hui-liang Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.17993
Pdf URL: https://arxiv.org/pdf/2409.17993
Copy Paste: [[2409.17993]] InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction(https://arxiv.org/abs/2409.17993)
Keywords: self-supervised
Abstract: We propose a novel unsupervised cross-modal homography estimation framework, based on interleaved modality transfer and self-supervised homography prediction, named InterNet. InterNet integrates modality transfer and self-supervised homography estimation, introducing an innovative interleaved optimization framework to alternately promote both components. The modality transfer gradually narrows the modality gaps, facilitating the self-supervised homography estimation to fully leverage the synthetic intra-modal data. The self-supervised homography estimation progressively achieves reliable predictions, thereby providing robust cross-modal supervision for the modality transfer. To further boost the estimation accuracy, we also formulate a fine-grained homography feature loss to improve the connection between two components. Furthermore, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. Experiments reveal that InterNet achieves the state-of-the-art (SOTA) performance among unsupervised methods, and even outperforms many supervised methods such as MHN and LocalTrans.

Title: Transferring disentangled representations: bridging the gap between synthetic and real images

Authors: Jacopo Dapueto, Nicoletta Noceti, Francesca Odone
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.18017
Pdf URL: https://arxiv.org/pdf/2409.18017
Copy Paste: [[2409.18017]] Transferring disentangled representations: bridging the gap between synthetic and real images(https://arxiv.org/abs/2409.18017)
Keywords: generative
Abstract: Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.

Title: EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Authors: Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2409.18042
Pdf URL: https://arxiv.org/pdf/2409.18042
Copy Paste: [[2409.18042]] EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions(https://arxiv.org/abs/2409.18042)
Keywords: foundation model
Abstract: GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

Title: Stable Video Portraits

Authors: Mirela Ostrek, Justus Thies
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.18083
Pdf URL: https://arxiv.org/pdf/2409.18083
Copy Paste: [[2409.18083]] Stable Video Portraits(https://arxiv.org/abs/2409.18083)
Keywords: diffusion, generative
Abstract: Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

Title: DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models

Authors: Helin Cao, Sven Behnke
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2409.18092
Pdf URL: https://arxiv.org/pdf/2409.18092
Copy Paste: [[2409.18092]] DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models(https://arxiv.org/abs/2409.18092)
Keywords: diffusion
Abstract: Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle's surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets and our approach outperforms the state-of-the-art for SSC.

Title: Self-supervised Pretraining for Cardiovascular Magnetic Resonance Cine Segmentation

Authors: Rob A. J. de Mooij, Josien P. W. Pluim, Cian M. Scannell
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.18100
Pdf URL: https://arxiv.org/pdf/2409.18100
Copy Paste: [[2409.18100]] Self-supervised Pretraining for Cardiovascular Magnetic Resonance Cine Segmentation(https://arxiv.org/abs/2409.18100)
Keywords: self-supervised
Abstract: Self-supervised pretraining (SSP) has shown promising results in learning from large unlabeled datasets and, thus, could be useful for automated cardiovascular magnetic resonance (CMR) short-axis cine segmentation. However, inconsistent reports of the benefits of SSP for segmentation have made it difficult to apply SSP to CMR. Therefore, this study aimed to evaluate SSP methods for CMR cine segmentation. To this end, short-axis cine stacks of 296 subjects (90618 2D slices) were used for unlabeled pretraining with four SSP methods; SimCLR, positional contrastive learning, DINO, and masked image modeling (MIM). Subsets of varying numbers of subjects were used for supervised fine-tuning of 2D models for each SSP method, as well as to train a 2D baseline model from scratch. The fine-tuned models were compared to the baseline using the 3D Dice similarity coefficient (DSC) in a test dataset of 140 subjects. The SSP methods showed no performance gains with the largest supervised fine-tuning subset compared to the baseline (DSC = 0.89). When only 10 subjects (231 2D slices) are available for supervised training, SSP using MIM (DSC = 0.86) improves over training from scratch (DSC = 0.82). This study found that SSP is valuable for CMR cine segmentation when labeled training data is scarce, but does not aid state-of-the-art deep learning methods when ample labeled data is available. Moreover, the choice of SSP method is important. The code is publicly available at: this https URL

Title: EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Authors: Jiaxiang Tang, Zhaoshuo Li, Zekun Hao, Xian Liu, Gang Zeng, Ming-Yu Liu, Qinsheng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.18114
Pdf URL: https://arxiv.org/pdf/2409.18114
Copy Paste: [[2409.18114]] EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation(https://arxiv.org/abs/2409.18114)
Keywords: diffusion
Abstract: Current auto-regressive mesh generation methods suffer from issues such as incompleteness, insufficient detail, and poor generalization. In this paper, we propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of $512^3$. We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency. Furthermore, our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization. Extensive experiments demonstrate the superior quality, diversity, and generalization capabilities of our model in both point cloud and image-conditioned mesh generation tasks.

Title: Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Authors: Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, Ying-Cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.18124
Pdf URL: https://arxiv.org/pdf/2409.18124
Copy Paste: [[2409.18124]] Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction(https://arxiv.org/abs/2409.18124)
Keywords: diffusion, foundation model
Abstract: Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also significantly enhances efficiency, being hundreds of times faster than most existing diffusion-based methods.

Title: FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

Authors: Wenliang Zhao, Minglei Shi, Xumin Yu, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.18128
Pdf URL: https://arxiv.org/pdf/2409.18128
Copy Paste: [[2409.18128]] FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner(https://arxiv.org/abs/2409.18128)
Keywords: diffusion, generative
Abstract: Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor's outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1%$\sim$58.3% on class-conditional generation and 29.8%$\sim$38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art. Code is available at this https URL.