2024-08-19

Title: Segment Anything for Videos: A Systematic Survey

Authors: Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.08315
Pdf URL: https://arxiv.org/pdf/2408.08315
Copy Paste: [[2408.08315]] Segment Anything for Videos: A Systematic Survey(https://arxiv.org/abs/2408.08315)
Keywords: foundation model
Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (\eg, text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.

Title: TurboEdit: Instant text-based image editing

Authors: Zongze Wu, Nicholas Kolkin, Jonathan Brandt, Richard Zhang, Eli Shechtman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.08332
Pdf URL: https://arxiv.org/pdf/2408.08332
Copy Paste: [[2408.08332]] TurboEdit: Instant text-based image editing(https://arxiv.org/abs/2408.08332)
Keywords: diffusion
Abstract: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

Title: METR: Image Watermarking with Large Number of Unique Messages

Authors: Alexander Varlamov, Daria Diatlova, Egor Spirin
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2408.08340
Pdf URL: https://arxiv.org/pdf/2408.08340
Copy Paste: [[2408.08340]] METR: Image Watermarking with Large Number of Unique Messages(https://arxiv.org/abs/2408.08340)
Keywords: diffusion, generative
Abstract: Improvements in diffusion models have boosted the quality of image generation, which has led researchers, companies, and creators to focus on improving watermarking algorithms. This provision would make it possible to clearly identify the creators of generative art. The main challenges that modern watermarking algorithms face have to do with their ability to withstand attacks and encrypt many unique messages, such as user IDs. In this paper, we present METR: Message Enhanced Tree-Ring, which is an approach that aims to address these challenges. METR is built on the Tree-Ring watermarking algorithm, a technique that makes it possible to encode multiple distinct messages without compromising attack resilience or image quality. This ensures the suitability of this watermarking algorithm for any Diffusion Model. In order to surpass the limitations on the quantity of encoded messages, we propose METR++, an enhanced version of METR. This approach, while limited to the Latent Diffusion Model architecture, is designed to inject a virtually unlimited number of unique messages. We demonstrate its robustness to attacks and ability to encrypt many unique messages while preserving image quality, which makes METR and METR++ hold great potential for practical applications in real-world settings. Our code is available at this https URL

Title: Penny-Wise and Pound-Foolish in Deepfake Detection

Authors: Yabin Wang, Zhiwu Huang, Su Zhou, Adam Prugel-Bennett, Xiaopeng Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08412
Pdf URL: https://arxiv.org/pdf/2408.08412
Copy Paste: [[2408.08412]] Penny-Wise and Pound-Foolish in Deepfake Detection(https://arxiv.org/abs/2408.08412)
Keywords: diffusion
Abstract: The diffusion of deepfake technologies has sparked serious concerns about its potential misuse across various domains, prompting the urgent need for robust detection methods. Despite advancement, many current approaches prioritize short-term gains at expense of long-term effectiveness. This paper critiques the overly specialized approach of fine-tuning pre-trained models solely with a penny-wise objective on a single deepfake dataset, while disregarding the pound-wise balance for generalization and knowledge retention. To address this "Penny-Wise and Pound-Foolish" issue, we propose a novel learning framework (PoundNet) for generalization of deepfake detection on a pre-trained vision-language model. PoundNet incorporates a learnable prompt design and a balanced objective to preserve broad knowledge from upstream tasks (object classification) while enhancing generalization for downstream tasks (deepfake detection). We train PoundNet on a standard single deepfake dataset, following common practice in the literature. We then evaluate its performance across 10 public large-scale deepfake datasets with 5 main evaluation metrics-forming the largest benchmark test set for assessing the generalization ability of deepfake detection models, to our knowledge. The comprehensive benchmark evaluation demonstrates the proposed PoundNet is significantly less "Penny-Wise and Pound-Foolish", achieving a remarkable improvement of 19% in deepfake detection performance compared to state-of-the-art methods, while maintaining a strong performance of 63% on object classification tasks, where other deepfake detection models tend to be ineffective. Code and data are open-sourced at this https URL.

Title: SpectralEarth: Training Hyperspectral Foundation Models at Scale

Authors: Nassim Ait Ali Braham, Conrad M Albrecht, Julien Mairal, Jocelyn Chanussot, Yi Wang, Xiao Xiang Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.08447
Pdf URL: https://arxiv.org/pdf/2408.08447
Copy Paste: [[2408.08447]] SpectralEarth: Training Hyperspectral Foundation Models at Scale(https://arxiv.org/abs/2408.08447)
Keywords: self-supervised, foundation model
Abstract: Foundation models have triggered a paradigm shift in computer vision and are increasingly being adopted in remote sensing, particularly for multispectral imagery. Yet, their potential in hyperspectral imaging (HSI) remains untapped due to the absence of comprehensive and globally representative hyperspectral datasets. To close this gap, we introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models leveraging data from the Environmental Mapping and Analysis Program (EnMAP). SpectralEarth comprises 538,974 image patches covering 415,153 unique locations from more than 11,636 globally distributed EnMAP scenes spanning two years of archive. Additionally, 17.5% of these locations include multiple timestamps, enabling multi-temporal HSI analysis. Utilizing state-of-the-art self-supervised learning (SSL) algorithms, we pretrain a series of foundation models on SpectralEarth. We integrate a spectral adapter into classical vision backbones to accommodate the unique characteristics of HSI. In tandem, we construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation. Experimental results support the versatility of our models, showcasing their generalizability across different tasks and sensors. We also highlight computational efficiency during model fine-tuning. The dataset, models, and source code will be made publicly available.

Title: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

Authors: Mohammadreza Samadi, Fred X. Han, Mohammad Salameh, Hao Wu, Fengyu Sun, Chunhua Zhou, Di Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08495
Pdf URL: https://arxiv.org/pdf/2408.08495
Copy Paste: [[2408.08495]] Achieving Complex Image Edits via Function Aggregation with Diffusion Models(https://arxiv.org/abs/2408.08495)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated strong performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet two key challenges persist. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations in other parts of the image. In this work, we introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. FunEditor is 5 to 24 times faster inference than existing methods on complex tasks like object movement. Our experiments demonstrate that FunEditor significantly outperforms recent baselines, including both inference-time optimization methods and fine-tuned models, across various metrics, such as image quality assessment (IQA) and object-background consistency.

Title: Efficient Image-to-Image Diffusion Classifier for Adversarial Robustness

Authors: Hefei Mei, Minjing Dong, Chang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08502
Pdf URL: https://arxiv.org/pdf/2408.08502
Copy Paste: [[2408.08502]] Efficient Image-to-Image Diffusion Classifier for Adversarial Robustness(https://arxiv.org/abs/2408.08502)
Keywords: diffusion
Abstract: Diffusion models (DMs) have demonstrated great potential in the field of adversarial robustness, where DM-based defense methods can achieve superior defense capability without adversarial training. However, they all require huge computational costs due to the usage of large-scale pre-trained DMs, making it difficult to conduct full evaluation under strong attacks and compare with traditional CNN-based methods. Simply reducing the network size and timesteps in DMs could significantly harm the image generation quality, which invalidates previous frameworks. To alleviate this issue, we redesign the diffusion framework from generating high-quality images to predicting distinguishable image labels. Specifically, we employ an image translation framework to learn many-to-one mapping from input samples to designed orthogonal image labels. Based on this framework, we introduce an efficient Image-to-Image diffusion classifier with a pruned U-Net structure and reduced diffusion timesteps. Besides the framework, we redesign the optimization objective of DMs to fit the target of image classification, where a new classification loss is incorporated in the DM-based image translation framework to distinguish the generated label from those of other classes. We conduct sufficient evaluations of the proposed classifier under various attacks on popular benchmarks. Extensive experiments show that our method achieves better adversarial robustness with fewer computational costs than DM-based and CNN-based methods. The code is available at this https URL.

Title: Visual-Friendly Concept Protection via Selective Adversarial Perturbations

Authors: Xiaoyue Mi, Fan Tang, Juan Cao, Peng Li, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08518
Pdf URL: https://arxiv.org/pdf/2408.08518
Copy Paste: [[2408.08518]] Visual-Friendly Concept Protection via Selective Adversarial Perturbations(https://arxiv.org/abs/2408.08518)
Keywords: diffusion
Abstract: Personalized concept generation by tuning diffusion models with a few images raises potential legal and ethical concerns regarding privacy and intellectual property rights. Researchers attempt to prevent malicious personalization using adversarial perturbations. However, previous efforts have mainly focused on the effectiveness of protection while neglecting the visibility of perturbations. They utilize global adversarial perturbations, which introduce noticeable alterations to original images and significantly degrade visual quality. In this work, we propose the Visual-Friendly Concept Protection (VCPro) framework, which prioritizes the protection of key concepts chosen by the image owner through adversarial perturbations with lower perceptibility. To ensure these perturbations are as inconspicuous as possible, we introduce a relaxed optimization objective to identify the least perceptible yet effective adversarial perturbations, solved using the Lagrangian multiplier method. Qualitative and quantitative experiments validate that VCPro achieves a better trade-off between the visibility of perturbations and protection effectiveness, effectively prioritizing the protection of target concepts in images with less perceptible perturbations.

Title: GS-ID: Illumination Decomposition on Gaussian Splatting via Diffusion Prior and Parametric Light Source Optimization

Authors: Kang Du, Zhihao Liang, Zeyu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.08524
Pdf URL: https://arxiv.org/pdf/2408.08524
Copy Paste: [[2408.08524]] GS-ID: Illumination Decomposition on Gaussian Splatting via Diffusion Prior and Parametric Light Source Optimization(https://arxiv.org/abs/2408.08524)
Keywords: diffusion
Abstract: We present GS-ID, a novel framework for illumination decomposition on Gaussian Splatting, achieving photorealistic novel view synthesis and intuitive light editing. Illumination decomposition is an ill-posed problem facing three main challenges: 1) priors for geometry and material are often lacking; 2) complex illumination conditions involve multiple unknown light sources; and 3) calculating surface shading with numerous light sources is computationally expensive. To address these challenges, we first introduce intrinsic diffusion priors to estimate the attributes for physically based rendering. Then we divide the illumination into environmental and direct components for joint optimization. Last, we employ deferred rendering to reduce the computational load. Our framework uses a learnable environment map and Spherical Gaussians (SGs) to represent light sources parametrically, therefore enabling controllable and photorealistic relighting on Gaussian Splatting. Extensive experiments and applications demonstrate that GS-ID produces state-of-the-art illumination decomposition results while achieving better geometry reconstruction and rendering performance.

Title: Inverse design with conditional cascaded diffusion models

Authors: Milad Habibi, Mark Fuge
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.08526
Pdf URL: https://arxiv.org/pdf/2408.08526
Copy Paste: [[2408.08526]] Inverse design with conditional cascaded diffusion models(https://arxiv.org/abs/2408.08526)
Keywords: diffusion, generative
Abstract: Adjoint-based design optimizations are usually computationally expensive and those costs scale with resolution. To address this, researchers have proposed machine learning approaches for inverse design that can predict higher-resolution solutions from lower cost/resolution ones. Due to the recent success of diffusion models over traditional generative models, we extend the use of diffusion models for multi-resolution tasks by proposing the conditional cascaded diffusion model (cCDM). Compared to GANs, cCDM is more stable to train, and each diffusion model within the cCDM can be trained independently, thus each model's parameters can be tuned separately to maximize the performance of the pipeline. Our study compares cCDM against a cGAN model with transfer learning. Our results demonstrate that the cCDM excels in capturing finer details, preserving volume fraction constraints, and minimizing compliance errors in multi-resolution tasks when a sufficient amount of high-resolution training data (more than 102 designs) is available. Furthermore, we explore the impact of training data size on the performance of both models. While both models show decreased performance with reduced high-resolution training data, the cCDM loses its superiority to the cGAN model with transfer learning when training data is limited (less than 102), and we show the break-even point for this transition. Also, we highlight that while the diffusion model may achieve better pixel-wise performance in both low-resolution and high-resolution scenarios, this does not necessarily guarantee that the model produces optimal compliance error or constraint satisfaction.

Title: Integrating Multi-view Analysis: Multi-view Mixture-of-Expert for Textual Personality Detection

Authors: Haohao Zhu, Xiaokun Zhang, Junyu Lu, Liang Yang, Hongfei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.08551
Pdf URL: https://arxiv.org/pdf/2408.08551
Copy Paste: [[2408.08551]] Integrating Multi-view Analysis: Multi-view Mixture-of-Expert for Textual Personality Detection(https://arxiv.org/abs/2408.08551)
Keywords: self-supervised
Abstract: Textual personality detection aims to identify personality traits by analyzing user-generated content. To achieve this effectively, it is essential to thoroughly examine user-generated content from various perspectives. However, previous studies have struggled with automatically extracting and effectively integrating information from multiple perspectives, thereby limiting their performance on personality detection. To address these challenges, we propose the Multi-view Mixture-of-Experts Model for Textual Personality Detection (MvP). MvP introduces a Multi-view Mixture-of-Experts (MoE) network to automatically analyze user posts from various perspectives. Additionally, it employs User Consistency Regularization to mitigate conflicts among different perspectives and learn a multi-view generic user representation. The model's training is optimized via a multi-task joint learning strategy that balances supervised personality detection with self-supervised user consistency constraints. Experimental results on two widely-used personality detection datasets demonstrate the effectiveness of the MvP model and the benefits of automatically analyzing user posts from diverse perspectives for textual personality detection.

Title: A New Chinese Landscape Paintings Generation Model based on Stable Diffusion using DreamBooth

Authors: Yujia Gu, Xinyu Fang, Xueyuan Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08561
Pdf URL: https://arxiv.org/pdf/2408.08561
Copy Paste: [[2408.08561]] A New Chinese Landscape Paintings Generation Model based on Stable Diffusion using DreamBooth(https://arxiv.org/abs/2408.08561)
Keywords: diffusion
Abstract: This study mainly introduces a method combining the Stable Diffusion Model (SDM) and Parameter-Efficient Fine-Tuning method for generating Chinese Landscape Paintings. This training process is accelerated by combining LoRA with pre-trained SDM and DreamBooth with pre-trained SDM, respectively. On the Chinese Landscape Paintings Internet dataset used in this paper, this study finds that SDM combined with DreamBooth exhibits superior performance, outperforming other models, including the generic pre-trained SDM and LoRA-based fine-tuning SDM. The SDM combined with DreamBooth achieves a FID of 12.75 on the dataset and outperforms all other models in terms of expert evaluation, highlighting the model's versatility in the field of Chinese Landscape Paintings given the unique identifier, high fidelity and high quality. This study illustrates the potential of specialised fine-tuning method to improve the performance of SDM on domain-specific tasks, particularly in the domain of Landscape Paintings.

Title: RadioDiff: An Effective Generative Diffusion Model for Sampling-Free Dynamic Radio Map Construction

Authors: Xiucheng Wang, Keda Tao, Nan Cheng, Zhisheng Yin, Zan Li, Yuan Zhang, Xuemin Shen
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2408.08593
Pdf URL: https://arxiv.org/pdf/2408.08593
Copy Paste: [[2408.08593]] RadioDiff: An Effective Generative Diffusion Model for Sampling-Free Dynamic Radio Map Construction(https://arxiv.org/abs/2408.08593)
Keywords: diffusion, generative
Abstract: Radio map (RM) is a promising technology that can obtain pathloss based on only location, which is significant for 6G network applications to reduce the communication costs for pathloss estimation. However, the construction of RM in traditional is either computationally intensive or depends on costly sampling-based pathloss measurements. Although the neural network (NN)-based method can efficiently construct the RM without sampling, its performance is still suboptimal. This is primarily due to the misalignment between the generative characteristics of the RM construction problem and the discrimination modeling exploited by existing NN-based methods. Thus, to enhance RM construction performance, in this paper, the sampling-free RM construction is modeled as a conditional generative problem, where a denoised diffusion-based method, named RadioDiff, is proposed to achieve high-quality RM construction. In addition, to enhance the diffusion model's capability of extracting features from dynamic environments, an attention U-Net with an adaptive fast Fourier transform module is employed as the backbone network to improve the dynamic environmental features extracting capability. Meanwhile, the decoupled diffusion model is utilized to further enhance the construction performance of RMs. Moreover, a comprehensive theoretical analysis of why the RM construction is a generative problem is provided for the first time, from both perspectives of data features and NN training methods. Experimental results show that the proposed RadioDiff achieves state-of-the-art performance in all three metrics of accuracy, structural similarity, and peak signal-to-noise ratio. The code is available at this https URL.

Title: Generative Dataset Distillation Based on Diffusion Model

Authors: Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, Miki Haseyama
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.08610
Pdf URL: https://arxiv.org/pdf/2408.08610
Copy Paste: [[2408.08610]] Generative Dataset Distillation Based on Diffusion Model(https://arxiv.org/abs/2408.08610)
Keywords: diffusion, generative
Abstract: This paper presents our method for the generative track of The First Dataset Distillation Challenge at ECCV 2024. Since the diffusion model has become the mainstay of generative models because of its high-quality generative effects, we focus on distillation methods based on the diffusion model. Considering that the track can only generate a fixed number of images in 10 minutes using a generative model for CIFAR-100 and Tiny-ImageNet datasets, we need to use a generative model that can generate images at high speed. In this study, we proposed a novel generative dataset distillation method based on Stable Diffusion. Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality. Compared to other diffusion models that can only generate images per class (IPC) = 1, our method can achieve an IPC = 10 for Tiny-ImageNet and an IPC = 20 for CIFAR-100, respectively. Additionally, to generate high-quality distilled datasets for CIFAR-100 and Tiny-ImageNet, we use the class information as text prompts and post data augmentation for the SDXL-Turbo model. Experimental results show the effectiveness of the proposed method, and we achieved third place in the generative track of the ECCV 2024 DD Challenge. Codes are available at this https URL.

Title: An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation

Authors: Peiming Guo, Sinuo Liu, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.08650
Pdf URL: https://arxiv.org/pdf/2408.08650
Copy Paste: [[2408.08650]] An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation(https://arxiv.org/abs/2408.08650)
Keywords: diffusion
Abstract: Photo-Sharing Multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. Using image text caption as the bridge, a pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. However, representing the images with text captions may loss important visual details and information and cause error propagation in the complex dialogue system. Besides, the pipeline model isolates the three models separately because discrete image text captions hinder end-to-end gradient propagation. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model. The large language model employs the Q-Former to perceive visual images in the input end. For image generation in the output end, we propose a dynamic vocabulary transformation matrix and use straight-through and gumbel-softmax techniques to align the large language model and stable diffusion model and achieve end-to-end gradient propagation. We perform experiments on PhotoChat and DialogCC datasets to evaluate our end-to-end model. Compared with pipeline models, the end-to-end model gains state-of-the-art performances on various metrics of text and image generation. More analysis experiments also verify the effectiveness of the end-to-end model for photo-sharing multi-modal dialogue generation.

Title: Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

Authors: Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.08670
Pdf URL: https://arxiv.org/pdf/2408.08670
Copy Paste: [[2408.08670]] Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning(https://arxiv.org/abs/2408.08670)
Keywords: foundation model
Abstract: Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this end, in this paper we introduce an efficient fine-tuning method for ViTs called $\textbf{ALaST}$ ($\textit{Adaptive Layer Selection Fine-Tuning for Vision Transformers}$) to speed up the fine-tuning process while reducing computational cost, memory load, and training time. Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. Therefore, at each fine-tuning step, we adaptively estimate the importance of all layers and we assign what we call ``compute budgets'' accordingly. Layers that were allocated lower budgets are either trained with a reduced number of input tokens or kept frozen. Freezing a layer reduces the computational cost and memory usage by preventing updates to its weights, while discarding tokens removes redundant data, speeding up processing and reducing memory requirements. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources across layers, resulting in substantial reductions in training time (up to 1.5x), FLOPs (up to 2x), and memory load (up to 2x) compared to traditional full fine-tuning approaches. Additionally, it can be successfully combined with other parameter-efficient fine-tuning methods, such as LoRA.

Title: A Novel Buffered Federated Learning Framework for Privacy-Driven Anomaly Detection in IIoT

Authors: Samira Kamali Poorazad, Chafika Benzaid, Tarik Taleb
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.08722
Pdf URL: https://arxiv.org/pdf/2408.08722
Copy Paste: [[2408.08722]] A Novel Buffered Federated Learning Framework for Privacy-Driven Anomaly Detection in IIoT(https://arxiv.org/abs/2408.08722)
Keywords: anomaly
Abstract: Industrial Internet of Things (IIoT) is highly sensitive to data privacy and cybersecurity threats. Federated Learning (FL) has emerged as a solution for preserving privacy, enabling private data to remain on local IIoT clients while cooperatively training models to detect network anomalies. However, both synchronous and asynchronous FL architectures exhibit limitations, particularly when dealing with clients with varying speeds due to data heterogeneity and resource constraints. Synchronous architecture suffers from straggler effects, while asynchronous methods encounter communication bottlenecks. Additionally, FL models are prone to adversarial inference attacks aimed at disclosing private training data. To address these challenges, we propose a Buffered FL (BFL) framework empowered by homomorphic encryption for anomaly detection in heterogeneous IIoT environments. BFL utilizes a novel weighted average time approach to mitigate both straggler effects and communication bottlenecks, ensuring fairness between clients with varying processing speeds through collaboration with a buffer-based server. The performance results, derived from two datasets, show the superiority of BFL compared to state-of-the-art FL methods, demonstrating improved accuracy and convergence speed while enhancing privacy preservation.

Title: Comparative Analysis of Generative Models: Enhancing Image Synthesis with VAEs, GANs, and Stable Diffusion

Authors: Sanchayan Vivekananthan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.08751
Pdf URL: https://arxiv.org/pdf/2408.08751
Copy Paste: [[2408.08751]] Comparative Analysis of Generative Models: Enhancing Image Synthesis with VAEs, GANs, and Stable Diffusion(https://arxiv.org/abs/2408.08751)
Keywords: diffusion, generative
Abstract: This paper examines three major generative modelling frameworks: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Stable Diffusion models. VAEs are effective at learning latent representations but frequently yield blurry results. GANs can generate realistic images but face issues such as mode collapse. Stable Diffusion models, while producing high-quality images with strong semantic coherence, are demanding in terms of computational resources. Additionally, the paper explores how incorporating Grounding DINO and Grounded SAM with Stable Diffusion improves image accuracy by utilising sophisticated segmentation and inpainting techniques. The analysis guides on selecting suitable models for various applications and highlights areas for further research.

Title: PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

Authors: Xiangdong Zhang, Shaofeng Zhang, Junchi Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08753
Pdf URL: https://arxiv.org/pdf/2408.08753
Copy Paste: [[2408.08753]] PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders(https://arxiv.org/abs/2408.08753)
Keywords: self-supervised
Abstract: Masked autoencoder has been widely explored in point cloud self-supervised learning, whereby the point cloud is generally divided into visible and masked parts. These methods typically include an encoder accepting visible patches (normalized) and corresponding patch centers (position) as input, with the decoder accepting the output of the encoder and the centers (position) of the masked parts to reconstruct each point in the masked patches. Then, the pre-trained encoders are used for downstream tasks. In this paper, we show a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. In other words, the centers of patches are important and the reconstruction objective does not necessarily rely on representations of the encoder, thus preventing the encoder from learning semantic representations. Based on this key observation, we propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) which guides the model to learn to predict the significant centers and use the predicted centers to replace the directly provided centers. Specifically, we propose a Predicting Center Module (PCM) that shares parameters with the original encoder with extra cross-attention to predict centers. Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE, particularly outperforming it by 5.50%, 6.03%, and 5.17% on three variants of ScanObjectNN. The code will be made publicly available.

Title: Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions

Authors: Chenming Tang, Zhixiang Wang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.08780
Pdf URL: https://arxiv.org/pdf/2408.08780
Copy Paste: [[2408.08780]] Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions(https://arxiv.org/abs/2408.08780)
Keywords: in-context
Abstract: With the help of in-context learning (ICL), large language models (LLMs) have achieved impressive performance across various tasks. However, the function of descriptive instructions during ICL remains under-explored. In this work, we propose an ensemble prompt framework to describe the selection criteria of multiple in-context examples, and preliminary experiments on machine translation (MT) across six translation directions confirm that this framework boosts ICL perfromance. But to our surprise, LLMs might not necessarily care what the descriptions actually say, and the performance gain is primarily caused by the ensemble format, since the framework could lead to improvement even with random descriptive nouns. We further apply this new ensemble prompt on a range of commonsense, math, logical reasoning and hallucination tasks with three LLMs and achieve promising results, suggesting again that designing a proper prompt format would be much more effective and efficient than paying effort into specific descriptions. Our code will be publicly available once this paper is published.

Title: Representation Learning of Geometric Trees

Authors: Zheng Zhang, Allen Zhang, Ruth Nelson, Giorgio Ascoli, Liang Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.08799
Pdf URL: https://arxiv.org/pdf/2408.08799
Copy Paste: [[2408.08799]] Representation Learning of Geometric Trees(https://arxiv.org/abs/2408.08799)
Keywords: self-supervised
Abstract: Geometric trees are characterized by their tree-structured layout and spatially constrained nodes and edges, which significantly impacts their topological attributes. This inherent hierarchical structure plays a crucial role in domains such as neuron morphology and river geomorphology, but traditional graph representation methods often overlook these specific characteristics of tree structures. To address this, we introduce a new representation learning framework tailored for geometric trees. It first features a unique message passing neural network, which is both provably geometrical structure-recoverable and rotation-translation invariant. To address the data label scarcity issue, our approach also includes two innovative training targets that reflect the hierarchical ordering and geometric structure of these geometric trees. This enables fully self-supervised learning without explicit labels. We validate our method's effectiveness on eight real-world datasets, demonstrating its capability to represent geometric trees.

Title: Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models

Authors: Lin Zhao, Xiao Chen, Eric Z. Chen, Yikang Liu, Terrence Chen, Shanhui Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08813
Pdf URL: https://arxiv.org/pdf/2408.08813
Copy Paste: [[2408.08813]] Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models(https://arxiv.org/abs/2408.08813)
Keywords: foundation model
Abstract: Medical image segmentation is crucial for clinical decision-making, but the scarcity of annotated data presents significant challenges. Few-shot segmentation (FSS) methods show promise but often require retraining on the target domain and struggle to generalize across different modalities. Similarly, adapting foundation models like the Segment Anything Model (SAM) for medical imaging has limitations, including the need for finetuning and domain-specific adaptation. To address these issues, we propose a novel method that adapts DINOv2 and Segment Anything Model 2 (SAM 2) for retrieval-augmented few-shot medical image segmentation. Our approach uses DINOv2's feature as query to retrieve similar samples from limited annotated data, which are then encoded as memories and stored in memory bank. With the memory attention mechanism of SAM 2, the model leverages these memories as conditions to generate accurate segmentation of the target image. We evaluated our framework on three medical image segmentation tasks, demonstrating superior performance and generalizability across various modalities without the need for any retraining or finetuning. Overall, this method offers a practical and effective solution for few-shot medical image segmentation and holds significant potential as a valuable annotation tool in clinical applications.

Title: PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future

Authors: Guangyi Wang, Yuren Cai, Lijiang Li, Wei Peng, Songzhi Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08822
Pdf URL: https://arxiv.org/pdf/2408.08822
Copy Paste: [[2408.08822]] PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future(https://arxiv.org/abs/2408.08822)
Keywords: diffusion
Abstract: Diffusion Probabilistic Models (DPMs) have shown remarkable potential in image generation, but their sampling efficiency is hindered by the need for numerous denoising steps. Most existing solutions accelerate the sampling process by proposing fast ODE solvers. However, the inevitable discretization errors of the ODE solvers are significantly magnified when the number of function evaluations (NFE) is fewer. In this work, we propose PFDiff, a novel training-free and orthogonal timestep-skipping strategy, which enables existing fast ODE solvers to operate with fewer NFE. Based on two key observations: a significant similarity in the model's outputs at time step size that is not excessively large during the denoising process of existing ODE solvers, and a high resemblance between the denoising process and SGD. PFDiff, by employing gradient replacement from past time steps and foresight updates inspired by Nesterov momentum, rapidly updates intermediate states, thereby reducing unnecessary NFE while correcting for discretization errors inherent in first-order ODE solvers. Experimental results demonstrate that PFDiff exhibits flexible applicability across various pre-trained DPMs, particularly excelling in conditional DPMs and surpassing previous state-of-the-art training-free methods. For instance, using DDIM as a baseline, we achieved 16.46 FID (4 NFE) compared to 138.81 FID with DDIM on ImageNet 64x64 with classifier guidance, and 13.06 FID (10 NFE) on Stable Diffusion with 7.5 guidance scale.

Title: SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

Authors: Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Feilong Tang, Ying Chen, Siying Li, Jie Ma, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.08870
Pdf URL: https://arxiv.org/pdf/2408.08870
Copy Paste: [[2408.08870]] SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation(https://arxiv.org/abs/2408.08870)
Keywords: foundation model
Abstract: Image segmentation plays an important role in vision understanding. Recently, the emerging vision foundation models continuously achieved superior performance on various tasks. Following such success, in this paper, we prove that the Segment Anything Model 2 (SAM2) can be a strong encoder for U-shaped segmentation models. We propose a simple but effective framework, termed SAM2-UNet, for versatile image segmentation. Specifically, SAM2-UNet adopts the Hiera backbone of SAM2 as the encoder, while the decoder uses the classic U-shaped design. Additionally, adapters are inserted into the encoder to allow parameter-efficient fine-tuning. Preliminary experiments on various downstream tasks, such as camouflaged object detection, salient object detection, marine animal segmentation, mirror detection, and polyp segmentation, demonstrate that our SAM2-UNet can simply beat existing specialized state-of-the-art methods without bells and whistles. Project page: \url{this https URL}.

Title: xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Authors: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2408.08872
Pdf URL: https://arxiv.org/pdf/2408.08872
Copy Paste: [[2408.08872]] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models(https://arxiv.org/abs/2408.08872)
Keywords: in-context
Abstract: This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.