2024-04-19

Title: Prompt-Driven Feature Diffusion for Open-World Semi-Supervised Learning

Authors: Marzi Heidari, Hanping Zhang, Yuhong Guo
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11795
Pdf URL: https://arxiv.org/pdf/2404.11795
Copy Paste: [[2404.11795]] Prompt-Driven Feature Diffusion for Open-World Semi-Supervised Learning(https://arxiv.org/abs/2404.11795)
Keywords: diffusion
Abstract: In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.

Title: When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

Authors: Yiqun Xie, Zhihao Wang, Weiye Chen, Zhili Li, Xiaowei Jia, Yanhua Li, Ruichen Wang, Kangyang Chai, Ruohan Li, Sergii Skakun
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11797
Pdf URL: https://arxiv.org/pdf/2404.11797
Copy Paste: [[2404.11797]] When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery(https://arxiv.org/abs/2404.11797)
Keywords: self-supervised, foundation model
Abstract: Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.

Title: Tailoring Generative Adversarial Networks for Smooth Airfoil Design

Authors: Joyjit Chattoraj, Jian Cheng Wong, Zhang Zexuan, Manna Dai, Xia Yingzhi, Li Jichao, Xu Xinxing, Ooi Chin Chun, Yang Feng, Dao My Ha, Liu Yong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.11816
Pdf URL: https://arxiv.org/pdf/2404.11816
Copy Paste: [[2404.11816]] Tailoring Generative Adversarial Networks for Smooth Airfoil Design(https://arxiv.org/abs/2404.11816)
Keywords: generative
Abstract: In the realm of aerospace design, achieving smooth curves is paramount, particularly when crafting objects such as airfoils. Generative Adversarial Network (GAN), a widely employed generative AI technique, has proven instrumental in synthesizing airfoil designs. However, a common limitation of GAN is the inherent lack of smoothness in the generated airfoil surfaces. To address this issue, we present a GAN model featuring a customized loss function built to produce seamlessly contoured airfoil designs. Additionally, our model demonstrates a substantial increase in design diversity compared to a conventional GAN augmented with a post-processing smoothing filter.

Title: Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Authors: Pushkar Shukla, Dhruv Srikanth, Lee Cohen, Matthew Turk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11819
Pdf URL: https://arxiv.org/pdf/2404.11819
Copy Paste: [[2404.11819]] Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement(https://arxiv.org/abs/2404.11819)
Keywords: generative
Abstract: We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.

Title: Hypergraph Self-supervised Learning with Sampling-efficient Signals

Authors: Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, Xuemin Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.11825
Pdf URL: https://arxiv.org/pdf/2404.11825
Copy Paste: [[2404.11825]] Hypergraph Self-supervised Learning with Sampling-efficient Signals(https://arxiv.org/abs/2404.11825)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) provides a promising alternative for representation learning on hypergraphs without costly labels. However, existing hypergraph SSL models are mostly based on contrastive methods with the instance-level discrimination strategy, suffering from two significant limitations: (1) They select negative samples arbitrarily, which is unreliable in deciding similar and dissimilar pairs, causing training bias. (2) They often require a large number of negative samples, resulting in expensive computational costs. To address the above issues, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised signals. Specifically, we introduce two sampling-free objectives leveraging the canonical correlation analysis as the node-level and group-level self-supervised signals. Additionally, we develop a novel hierarchical membership-level contrast objective motivated by the cascading overlap relationship in hypergraphs, which can further reduce membership sampling bias and improve the efficiency of sample utilization. Through comprehensive experiments on 7 real-world hypergraphs, we demonstrate the superiority of our approach over the state-of-the-art method in terms of both effectiveness and efficiency.

Title: From Image to Video, what do we need in multimodal LLMs?

Authors: Suyuan Huang, Haoxin Zhang, Yan Gao, Yao Hu, Zengchang Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11865
Pdf URL: https://arxiv.org/pdf/2404.11865
Copy Paste: [[2404.11865]] From Image to Video, what do we need in multimodal LLMs?(https://arxiv.org/abs/2404.11865)
Keywords: foundation model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these methods.In response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.

Title: OPTiML: Dense Semantic Invariance Using Optimal Transport for Self-Supervised Medical Image Representation

Authors: Azad Singh, Vandan Gorade, Deepak Mishra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11868
Pdf URL: https://arxiv.org/pdf/2404.11868
Copy Paste: [[2404.11868]] OPTiML: Dense Semantic Invariance Using Optimal Transport for Self-Supervised Medical Image Representation(https://arxiv.org/abs/2404.11868)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.

Title: FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Authors: Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, Antoni B. Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11895
Pdf URL: https://arxiv.org/pdf/2404.11895
Copy Paste: [[2404.11895]] FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models(https://arxiv.org/abs/2404.11895)
Keywords: diffusion, generative
Abstract: Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.

Title: EdgeFusion: On-Device Text-to-Image Generation

Authors: Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11925
Pdf URL: https://arxiv.org/pdf/2404.11925
Copy Paste: [[2404.11925]] EdgeFusion: On-Device Text-to-Image Generation(https://arxiv.org/abs/2404.11925)
Keywords: diffusion, generative
Abstract: The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

Title: LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Authors: Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim, Shinkook Choi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11936
Pdf URL: https://arxiv.org/pdf/2404.11936
Copy Paste: [[2404.11936]] LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights(https://arxiv.org/abs/2404.11936)
Keywords: diffusion, generative
Abstract: Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.

Title: Sketch-guided Image Inpainting with Partial Discrete Diffusion Process

Authors: Nakul Sharma, Aditay Tripathi, Anirban Chakraborty, Anand Mishra
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11949
Pdf URL: https://arxiv.org/pdf/2404.11949
Copy Paste: [[2404.11949]] Sketch-guided Image Inpainting with Partial Discrete Diffusion Process(https://arxiv.org/abs/2404.11949)
Keywords: diffusion
Abstract: In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at https://github.com/vl2g/Sketch-Inpainting .

Title: The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models

Authors: Cheng Shi, Sibei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11957
Pdf URL: https://arxiv.org/pdf/2404.11957
Copy Paste: [[2404.11957]] The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models(https://arxiv.org/abs/2404.11957)
Keywords: foundation model
Abstract: Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP

Title: Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Authors: Qiyuan Dai, Sibei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11998
Pdf URL: https://arxiv.org/pdf/2404.11998
Copy Paste: [[2404.11998]] Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation(https://arxiv.org/abs/2404.11998)
Keywords: foundation model
Abstract: Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.

Title: Sequential Compositional Generalization in Multimodal Models

Authors: Semih Yagcioglu, Osman Batur İnce, Aykut Erdem, Erkut Erdem, Desmond Elliott, Deniz Yuret
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12013
Pdf URL: https://arxiv.org/pdf/2404.12013
Copy Paste: [[2404.12013]] Sequential Compositional Generalization in Multimodal Models(https://arxiv.org/abs/2404.12013)
Keywords: generative
Abstract: The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{this http URL}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

Title: S3R-Net: A Single-Stage Approach to Self-Supervised Shadow Removal

Authors: Nikolina Kubiak, Armin Mustafa, Graeme Phillipson, Stephen Jolly, Simon Hadfield
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2404.12103
Pdf URL: https://arxiv.org/pdf/2404.12103
Copy Paste: [[2404.12103]] S3R-Net: A Single-Stage Approach to Self-Supervised Shadow Removal(https://arxiv.org/abs/2404.12103)
Keywords: self-supervised
Abstract: In this paper we present S3R-Net, the Self-Supervised Shadow Removal Network. The two-branch WGAN model achieves self-supervision relying on the unify-and-adaptphenomenon - it unifies the style of the output data and infers its characteristics from a database of unaligned shadow-free reference images. This approach stands in contrast to the large body of supervised frameworks. S3R-Net also differentiates itself from the few existing self-supervised models operating in a cycle-consistent manner, as it is a non-cyclic, unidirectional solution. The proposed framework achieves comparable numerical scores to recent selfsupervised shadow removal models while exhibiting superior qualitative performance and keeping the computational cost low.

Title: StyleBooth: Image Style Editing with Multimodal Instruction

Authors: Zhen Han, Chaojie Mao, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12154
Pdf URL: https://arxiv.org/pdf/2404.12154
Copy Paste: [[2404.12154]] StyleBooth: Image Style Editing with Multimodal Instruction(https://arxiv.org/abs/2404.12154)
Keywords: diffusion
Abstract: Given an original image, image editing aims to generate an image that align with the provided instruction. The challenges are to accept multimodal inputs as instructions and a scarcity of high-quality training data, including crucial triplets of source/target image pairs and multimodal (text and image) instructions. In this paper, we focus on image style editing and present StyleBooth, a method that proposes a comprehensive framework for image editing and a feasible strategy for building a high-quality style editing dataset. We integrate encoded textual instruction and image exemplar as a unified condition for diffusion model, enabling the editing of original image following multimodal instructions. Furthermore, by iterative style-destyle tuning and editing and usability filtering, the StyleBooth dataset provides content-consistent stylized/plain image pairs in various categories of styles. To show the flexibility of StyleBooth, we conduct experiments on diverse tasks, such as text-based style editing, exemplar-based style editing and compositional style editing. The results demonstrate that the quality and variety of training data significantly enhance the ability to preserve content and improve the overall quality of generated images in editing tasks. Project page can be found at https://ali-vilab.github.io/stylebooth-page/.

Title: How to Benchmark Vision Foundation Models for Semantic Segmentation?

Authors: Tommie Kerssies, Daan de Geus, Gijs Dubbelman
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2404.12172
Pdf URL: https://arxiv.org/pdf/2404.12172
Copy Paste: [[2404.12172]] How to Benchmark Vision Foundation Models for Semantic Segmentation?(https://arxiv.org/abs/2404.12172)
Keywords: foundation model
Abstract: Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.

Title: Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Authors: Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12210
Pdf URL: https://arxiv.org/pdf/2404.12210
Copy Paste: [[2404.12210]] Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training(https://arxiv.org/abs/2404.12210)
Keywords: self-supervised
Abstract: Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

Title: Blind Localization and Clustering of Anomalies in Textures

Authors: Andrei-Timotei Ardelean, Tim Weyrich
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12246
Pdf URL: https://arxiv.org/pdf/2404.12246
Copy Paste: [[2404.12246]] Blind Localization and Clustering of Anomalies in Textures(https://arxiv.org/abs/2404.12246)
Keywords: anomaly
Abstract: Anomaly detection and localization in images is a growing field in computer vision. In this area, a seemingly understudied problem is anomaly clustering, i.e., identifying and grouping different types of anomalies in a fully unsupervised manner. In this work, we propose a novel method for clustering anomalies in largely stationary images (textures) in a blind setting. That is, the input consists of normal and anomalous images without distinction and without labels. What contributes to the difficulty of the task is that anomalous regions are often small and may present only subtle changes in appearance, which can be easily overshadowed by the genuine variance in the texture. Moreover, each anomaly type may have a complex appearance distribution. We introduce a novel scheme for solving this task using a combination of blind anomaly localization and contrastive learning. By identifying the anomalous regions with high fidelity, we can restrict our focus to those regions of interest; then, contrastive learning is employed to increase the separability of different anomaly types and reduce the intra-class variation. Our experiments show that the proposed solution yields significantly better results compared to prior work, setting a new state of the art. Project page: https://reality.tf.fau.de/pub/ardelean2024blind.html.

Title: Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models

Authors: Israel A. Laurensi, Alceu de Souza Britto Jr., Jean Paul Barddal, Alessandro Lameiras Koerich
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12260
Pdf URL: https://arxiv.org/pdf/2404.12260
Copy Paste: [[2404.12260]] Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models(https://arxiv.org/abs/2404.12260)
Keywords: generative
Abstract: Facial expression recognition is a pivotal component in machine learning, facilitating various applications. However, convolutional neural networks (CNNs) are often plagued by catastrophic forgetting, impeding their adaptability. The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. Moreover, ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. This dual approach enables CNNs to retain past knowledge while learning new tasks, enhancing their performance in emotion recognition. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset while making the CNN retain previously learned knowledge.

Title: Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoder

Authors: Sheikh Waqas Akhtar
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2404.12267
Pdf URL: https://arxiv.org/pdf/2404.12267
Copy Paste: [[2404.12267]] Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoder(https://arxiv.org/abs/2404.12267)
Keywords: generative
Abstract: Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.

Title: Guided Discrete Diffusion for Electronic Health Record Generation

Authors: Zixiang Chen, Jun Han, Yongqian Li, Yiwen Kou, Eran Halperin, Robert E. Tillman, Quanquan Gu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.12314
Pdf URL: https://arxiv.org/pdf/2404.12314
Copy Paste: [[2404.12314]] Guided Discrete Diffusion for Electronic Health Record Generation(https://arxiv.org/abs/2404.12314)
Keywords: diffusion, generative
Abstract: Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.

Title: Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Authors: Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12333
Pdf URL: https://arxiv.org/pdf/2404.12333
Copy Paste: [[2404.12333]] Customizing Text-to-Image Diffusion with Camera Viewpoint Control(https://arxiv.org/abs/2404.12333)
Keywords: diffusion
Abstract: Model customization introduces new concepts to existing text-to-image models, enabling the generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding "top-view") to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and the object's camera pose.

Title: Measuring Feature Dependency of Neural Networks by Collapsing Feature Dimensions in the Data Manifold

Authors: Yinzhu Jin, Matthew B. Dwyer, P. Thomas Fletcher
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2404.12341
Pdf URL: https://arxiv.org/pdf/2404.12341
Copy Paste: [[2404.12341]] Measuring Feature Dependency of Neural Networks by Collapsing Feature Dimensions in the Data Manifold(https://arxiv.org/abs/2404.12341)
Keywords: generative
Abstract: This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significantly harm its performance. A targeted feature is "removed" by collapsing the dimension in the data distribution that corresponds to that feature. We perform this by moving data points along the feature dimension to a baseline feature value while staying on the data manifold, as estimated by a deep generative model. Then we observe how the model's performance changes on the modified test data set, with the target feature dimension removed. We test our method on deep neural network models trained on synthetic image data with known ground truth, an Alzheimer's disease prediction task using MRI and hippocampus segmentations from the OASIS-3 dataset, and a cell nuclei classification task using the Lizard dataset.

Title: Large Language Models in Targeted Sentiment Analysis

Authors: Nicolay Rusnachenko, Anton Golubev, Natalia Loukachevitch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12342
Pdf URL: https://arxiv.org/pdf/2404.12342
Copy Paste: [[2404.12342]] Large Language Models in Targeted Sentiment Analysis(https://arxiv.org/abs/2404.12342)
Keywords: generative
Abstract: In this paper we investigate the use of decoder-based generative transformers for extracting sentiment towards the named entities in Russian news articles. We study sentiment analysis capabilities of instruction-tuned large language models (LLMs). We consider the dataset of RuSentNE-2023 in our study. The first group of experiments was aimed at the evaluation of zero-shot capabilities of LLMs with closed and open transparencies. The second covers the fine-tuning of Flan-T5 using the "chain-of-thought" (CoT) three-hop reasoning framework (THoR). We found that the results of the zero-shot approaches are similar to the results achieved by baseline fine-tuned encoder-based transformers (BERT-base). Reasoning capabilities of the fine-tuned Flan-T5 models with THoR achieve at least 5% increment with the base-size model compared to the results of the zero-shot experiment. The best results of sentiment analysis on RuSentNE-2023 were achieved by fine-tuned Flan-T5-xl, which surpassed the results of previous state-of-the-art transformer-based classifiers. Our CoT application framework is publicly available: https://github.com/nicolay-r/Reasoning-for-Sentiment-Analysis-Framework

Title: AniClipart: Clipart Animation with Text-to-Video Priors

Authors: Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2404.12347
Pdf URL: https://arxiv.org/pdf/2404.12347
Copy Paste: [[2404.12347]] AniClipart: Clipart Animation with Text-to-Video Priors(https://arxiv.org/abs/2404.12347)
Keywords: diffusion
Abstract: Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define B\'{e}zier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

Title: Point-In-Context: Understanding Point Cloud via In-Context Learning

Authors: Mengyuan Liu, Zhongbin Fang, Xia Li, Joachim M. Buhmann, Xiangtai Li, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12352
Pdf URL: https://arxiv.org/pdf/2404.12352
Copy Paste: [[2404.12352]] Point-In-Context: Understanding Point Cloud via In-Context Learning(https://arxiv.org/abs/2404.12352)
Keywords: in-context
Abstract: With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.

Title: Towards a Foundation Model for Partial Differential Equation: Multi-Operator Learning and Extrapolation

Authors: Jingmin Sun, Yuxuan Liu, Zecheng Zhang, Hayden Schaeffer
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2404.12355
Pdf URL: https://arxiv.org/pdf/2404.12355
Copy Paste: [[2404.12355]] Towards a Foundation Model for Partial Differential Equation: Multi-Operator Learning and Extrapolation(https://arxiv.org/abs/2404.12355)
Keywords: foundation model
Abstract: Foundation models, such as large language models, have demonstrated success in addressing various language and image processing tasks. In this work, we introduce a multi-modal foundation model for scientific problems, named PROSE-PDE. Our model, designed for bi-modality to bi-modality learning, is a multi-operator learning approach which can predict future states of spatiotemporal systems while concurrently learning the underlying governing equations of the physical system. Specifically, we focus on multi-operator learning by training distinct one-dimensional time-dependent nonlinear constant coefficient partial differential equations, with potential applications to many physical applications including physics, geology, and biology. More importantly, we provide three extrapolation studies to demonstrate that PROSE-PDE can generalize physical features through the robust training of multiple operators and that the proposed model can extrapolate to predict PDE solutions whose models or data were unseen during the training. Furthermore, we show through systematic numerical experiments that the utilization of the symbolic modality in our model effectively resolves the well-posedness problems with training multiple operators and thus enhances our model's predictive capabilities.

Title: From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Authors: Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.12358
Pdf URL: https://arxiv.org/pdf/2404.12358
Copy Paste: [[2404.12358]] From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function(https://arxiv.org/abs/2404.12358)
Keywords: generative
Abstract: Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

Title: Inverse Neural Rendering for Explainable Multi-Object Tracking

Authors: Julian Ost, Tanushree Banerjee, Mario Bijelic, Felix Heide
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2404.12359
Pdf URL: https://arxiv.org/pdf/2404.12359
Copy Paste: [[2404.12359]] Inverse Neural Rendering for Explainable Multi-Object Tracking(https://arxiv.org/abs/2404.12359)
Keywords: generative
Abstract: Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.

Title: MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Authors: Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, Zuozhu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12372
Pdf URL: https://arxiv.org/pdf/2404.12372
Copy Paste: [[2404.12372]] MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale(https://arxiv.org/abs/2404.12372)
Keywords: generative
Abstract: Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

Title: Lazy Diffusion Transformer for Interactive Image Editing

Authors: Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2404.12382
Pdf URL: https://arxiv.org/pdf/2404.12382
Copy Paste: [[2404.12382]] Lazy Diffusion Transformer for Interactive Image Editing(https://arxiv.org/abs/2404.12382)
Keywords: diffusion
Abstract: We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.

Title: G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

Authors: Yufei Ye, Abhinav Gupta, Kris Kitani, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12383
Pdf URL: https://arxiv.org/pdf/2404.12383
Copy Paste: [[2404.12383]] G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis(https://arxiv.org/abs/2404.12383)
Keywords: diffusion, generative
Abstract: We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines. Project website: https://judyye.github.io/ghop-www

Title: SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

Authors: Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12386
Pdf URL: https://arxiv.org/pdf/2404.12386
Copy Paste: [[2404.12386]] SOHES: Self-supervised Open-world Hierarchical Entity Segmentation(https://arxiv.org/abs/2404.12386)
Keywords: self-supervised
Abstract: Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.

Title: VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Authors: Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12388
Pdf URL: https://arxiv.org/pdf/2404.12388
Copy Paste: [[2404.12388]] VideoGigaGAN: Towards Detail-rich Video Super-Resolution(https://arxiv.org/abs/2404.12388)
Keywords: generative
Abstract: Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.

Title: Moving Object Segmentation: All You Need Is SAM (and Flow)

Authors: Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12389
Pdf URL: https://arxiv.org/pdf/2404.12389
Copy Paste: [[2404.12389]] Moving Object Segmentation: All You Need Is SAM (and Flow)(https://arxiv.org/abs/2404.12389)
Keywords: self-supervised
Abstract: The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.

Title: On the Content Bias in Fréchet Video Distance

Authors: Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12391
Pdf URL: https://arxiv.org/pdf/2404.12391
Copy Paste: [[2404.12391]] On the Content Bias in Fréchet Video Distance(https://arxiv.org/abs/2404.12391)
Keywords: self-supervised
Abstract: Fr\'echet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.