2024-01-18

Title: Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision

Authors: Yi Cao, Swetava Ganguli, Vipul Pandey
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2401.08581
Pdf URL: https://arxiv.org/pdf/2401.08581
Copy Paste: [[2401.08581]] Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision(https://arxiv.org/abs/2401.08581)
Keywords: self-supervised
Abstract: There exists a correlation between geospatial activity temporal patterns and type of land use. A novel self-supervised approach is proposed to stratify landscape based on mobility activity time series. First, the time series signal is transformed to the frequency domain and then compressed into task-agnostic temporal embeddings by a contractive autoencoder, which preserves cyclic temporal patterns observed in time series. The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling of downstream geospatial tasks using deep semantic segmentation. Experiments show that temporal embeddings are semantically meaningful representations of time series data and are effective across different tasks such as classifying residential area and commercial areas. Temporal embeddings transform sequential, spatiotemporal motion trajectory data into semantically meaningful image-like tensor representations that can be combined (multimodal fusion) with other data modalities that are or can be transformed into image-like tensor representations (for e.g., RBG imagery, graph embeddings of road networks, passively collected imagery like SAR, etc.) to facilitate multimodal learning in geospatial computer vision. Multimodal computer vision is critical for training machine learning models for geospatial feature detection to keep a geospatial mapping service up-to-date in real-time and can significantly improve user experience and above all, user safety.

Title: Improved Pothole Detection Using YOLOv7 and ESRGAN

Authors: Nirmal Kumar Rout, Gyanateet Dutta, Varun Sinha, Arghadeep Dey, Subhrangshu Mukherjee, Gopal Gupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08588
Pdf URL: https://arxiv.org/pdf/2401.08588
Copy Paste: [[2401.08588]] Improved Pothole Detection Using YOLOv7 and ESRGAN(https://arxiv.org/abs/2401.08588)
Keywords: generative
Abstract: Potholes are common road hazards that is causing damage to vehicles and posing a safety risk to drivers. The introduction of Convolutional Neural Networks (CNNs) is widely used in the industry for object detection based on Deep Learning methods and has achieved significant progress in hardware improvement and software implementations. In this paper, a unique better algorithm is proposed to warrant the use of low-resolution cameras or low-resolution images and video feed for automatic pothole detection using Super Resolution (SR) through Super Resolution Generative Adversarial Networks (SRGANs). Then we have proceeded to establish a baseline pothole detection performance on low quality and high quality dashcam images using a You Only Look Once (YOLO) network, namely the YOLOv7 network. We then have illustrated and examined the speed and accuracy gained above the benchmark after having upscaling implementation on the low quality images.

Title: Online Anomaly Detection over Live Social Video Streaming

Authors: Chengkun He, Xiangmin Zhou, Chen Wang, Iqbal Gondal, Jie Shao, Xun Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08615
Pdf URL: https://arxiv.org/pdf/2401.08615
Copy Paste: [[2401.08615]] Online Anomaly Detection over Live Social Video Streaming(https://arxiv.org/abs/2401.08615)
Keywords: anomaly
Abstract: Social video anomaly is an observation in video streams that does not conform to a common pattern of dataset's behaviour. Social video anomaly detection plays a critical role in applications from e-commerce to e-learning. Traditionally, anomaly detection techniques are applied to find anomalies in video broadcasting. However, they neglect the live social video streams which contain interactive talk, speech, or lecture with audience. In this paper, we propose a generic framework for effectively online detecting Anomalies Over social Video LIve Streaming (AOVLIS). Specifically, we propose a novel deep neural network model called Coupling Long Short-Term Memory (CLSTM) that adaptively captures the history behaviours of the presenters and audience, and their mutual interactions to predict their behaviour at next time point over streams. Then we well integrate the CLSTM with a decoder layer, and propose a new reconstruction error-based scoring function $RE_{IA}$ to calculate the anomaly score of each video segment for anomaly detection. After that, we propose a novel model update scheme that incrementally maintains CLSTM and decoder. Moreover, we design a novel upper bound and ADaptive Optimisation Strategy (ADOS) for improving the efficiency of our solution. Extensive experiments are conducted to prove the superiority of AOVLIS.

Title: One-Step Diffusion Distillation via Deep Equilibrium Models

Authors: Zhengyang Geng, Ashwini Pokle, J. Zico Kolter
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2401.08639
Pdf URL: https://arxiv.org/pdf/2401.08639
Copy Paste: [[2401.08639]] One-Step Diffusion Distillation via Deep Equilibrium Models(https://arxiv.org/abs/2401.08639)
Keywords: diffusion, generative
Abstract: Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5\times$ larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available.

Title: SAiD: Speech-driven Blendshape Facial Animation with Diffusion

Authors: Inkyu Park, Jaewoong Cho
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2401.08655
Pdf URL: https://arxiv.org/pdf/2401.08655
Copy Paste: [[2401.08655]] SAiD: Speech-driven Blendshape Facial Animation with Diffusion(https://arxiv.org/abs/2401.08655)
Keywords: diffusion
Abstract: Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.

Title: Attention Modules Improve Modern Image-Level Anomaly Detection: A DifferNet Case Study

Authors: André Luiz B. Vieira e Silva, Francisco Simões, Danny Kowerko, Tobias Schlosser, Felipe Battisti, Veronica Teichrieb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08686
Pdf URL: https://arxiv.org/pdf/2401.08686
Copy Paste: [[2401.08686]] Attention Modules Improve Modern Image-Level Anomaly Detection: A DifferNet Case Study(https://arxiv.org/abs/2401.08686)
Keywords: anomaly
Abstract: Within (semi-)automated visual inspection, learning-based approaches for assessing visual defects, including deep neural networks, enable the processing of otherwise small defect patterns in pixel size on high-resolution imagery. The emergence of these often rarely occurring defect patterns explains the general need for labeled data corpora. To not only alleviate this issue but to furthermore advance the current state of the art in unsupervised visual inspection, this contribution proposes a DifferNet-based solution enhanced with attention modules utilizing SENet and CBAM as backbone - AttentDifferNet - to improve the detection and classification capabilities on three different visual inspection and anomaly detection datasets: MVTec AD, InsPLAD-fault, and Semiconductor Wafer. In comparison to the current state of the art, it is shown that AttentDifferNet achieves improved results, which are, in turn, highlighted throughout our quantitative as well as qualitative evaluation, indicated by a general improvement in AUC of 94.34 vs. 92.46, 96.67 vs. 94.69, and 90.20 vs. 88.74%. As our variants to AttentDifferNet show great prospects in the context of currently investigated approaches, a baseline is formulated, emphasizing the importance of attention for anomaly detection.

Title: NODI: Out-Of-Distribution Detection with Noise from Diffusion

Authors: Jingqiu Zhou, Aojun Zou, Hongshen Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2401.08689
Pdf URL: https://arxiv.org/pdf/2401.08689
Copy Paste: [[2401.08689]] NODI: Out-Of-Distribution Detection with Noise from Diffusion(https://arxiv.org/abs/2401.08689)
Keywords: diffusion
Abstract: Out-of-distribution (OOD) detection is a crucial part of deploying machine learning models safely. It has been extensively studied with a plethora of methods developed in the literature. This problem is tackled with an OOD score computation, however, previous methods compute the OOD scores with limited usage of the in-distribution dataset. For instance, the OOD scores are computed with information from a small portion of the in-distribution data. Furthermore, these methods encode images with a neural image encoder. The robustness of these methods is rarely checked with respect to image encoders of different training methods and architectures. In this work, we introduce the diffusion process into the OOD task. The diffusion model integrates information on the whole training set into the predicted noise vectors. What's more, we deduce a closed-form solution for the noise vector (stable point). Then the noise vector is converted into our OOD score, we test both the deep model predicted noise vector and the closed-form noise vector on the OOD benchmarks \cite{openood}. Our method outperforms previous OOD methods across all types of image encoders (Table. \ref{main}). A $3.5\%$ performance gain is achieved with the MAE-based image encoder. Moreover, we studied the robustness of OOD methods by applying different types of image encoders. Some OOD methods failed to generalize well when switching image encoders from ResNet to Vision Transformers, our method performs exhibits good robustness with all the image encoders.

Title: Contrastive Learning with Negative Sampling Correction

Authors: Lu Wang, Chao Du, Pu Zhao, Chuan Luo, Zhangchi Zhu, Bo Qiao, Wei Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.08690
Pdf URL: https://arxiv.org/pdf/2401.08690
Copy Paste: [[2401.08690]] Contrastive Learning with Negative Sampling Correction(https://arxiv.org/abs/2401.08690)
Keywords: self-supervised
Abstract: As one of the most effective self-supervised representation learning methods, contrastive learning (CL) relies on multiple negative pairs to contrast against each positive pair. In the standard practice of contrastive learning, data augmentation methods are utilized to generate both positive and negative pairs. While existing works have been focusing on improving the positive sampling, the negative sampling process is often overlooked. In fact, the generated negative samples are often polluted by positive samples, which leads to a biased loss and performance degradation. To correct the negative sampling bias, we propose a novel contrastive learning method named Positive-Unlabeled Contrastive Learning (PUCL). PUCL treats the generated negative samples as unlabeled samples and uses information from positive samples to correct bias in contrastive loss. We prove that the corrected loss used in PUCL only incurs a negligible bias compared to the unbiased contrastive loss. PUCL can be applied to general contrastive learning problems and outperforms state-of-the-art methods on various image and graph classification tasks. The code of PUCL is in the supplementary file.

Title: Unsupervised Pre-Training for 3D Leaf Instance Segmentation

Authors: Gianmarco Roggiolani, Federico Magistri, Tiziano Guadagnino, Jens Behley, Cyrill Stachniss
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08720
Pdf URL: https://arxiv.org/pdf/2401.08720
Copy Paste: [[2401.08720]] Unsupervised Pre-Training for 3D Leaf Instance Segmentation(https://arxiv.org/abs/2401.08720)
Keywords: self-supervised
Abstract: Crops for food, feed, fiber, and fuel are key natural resources for our society. Monitoring plants and measuring their traits is an important task in agriculture often referred to as plant phenotyping. Traditionally, this task is done manually, which is time- and labor-intensive. Robots can automate phenotyping providing reproducible and high-frequency measurements. Today's perception systems use deep learning to interpret these measurements, but require a substantial amount of annotated data to work well. Obtaining such labels is challenging as it often requires background knowledge on the side of the labelers. This paper addresses the problem of reducing the labeling effort required to perform leaf instance segmentation on 3D point clouds, which is a first step toward phenotyping in 3D. Separating all leaves allows us to count them and compute relevant traits as their areas, lengths, and widths. We propose a novel self-supervised task-specific pre-training approach to initialize the backbone of a network for leaf instance segmentation. We also introduce a novel automatic postprocessing that considers the difficulty of correctly segmenting the points close to the stem, where all the leaves petiole overlap. The experiments presented in this paper suggest that our approach boosts the performance over all the investigated scenarios. We also evaluate the embeddings to assess the quality of the fully unsupervised approach and see a higher performance of our domain-specific postprocessing.

Title: Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks

Authors: Chenyu Zhang, Lanjun Wang, Anan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08725
Pdf URL: https://arxiv.org/pdf/2401.08725
Copy Paste: [[2401.08725]] Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks(https://arxiv.org/abs/2401.08725)
Keywords: diffusion
Abstract: Recent developments in text-to-image models, particularly Stable Diffusion, have marked significant achievements in various applications. With these advancements, there are growing safety concerns about the vulnerability of the model that malicious entities exploit to generate targeted harmful images. However, the existing methods in the vulnerability of the model mainly evaluate the alignment between the prompt and generated images, but fall short in revealing the vulnerability associated with targeted image generation. In this study, we formulate the problem of targeted adversarial attack on Stable Diffusion and propose a framework to generate adversarial prompts. Specifically, we design a gradient-based embedding optimization method to craft reliable adversarial prompts that guide stable diffusion to generate specific images. Furthermore, after obtaining successful adversarial prompts, we reveal the mechanisms that cause the vulnerability of the model. Extensive experiments on two targeted attack tasks demonstrate the effectiveness of our method in targeted attacks. The code can be obtained in https://github.com/datar001/Revealing-Vulnerabilities-in-Stable-Diffusion-via-Targeted-Attacks.

Title: SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Authors: Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2401.08740
Pdf URL: https://arxiv.org/pdf/2401.08740
Copy Paste: [[2401.08740]] SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers(https://arxiv.org/abs/2401.08740)
Keywords: diffusion, generative
Abstract: We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06.

Title: Fixed Point Diffusion Models

Authors: Xingjian Bai, Luke Melas-Kyriazi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2401.08741
Pdf URL: https://arxiv.org/pdf/2401.08741
Copy Paste: [[2401.08741]] Fixed Point Diffusion Models(https://arxiv.org/abs/2401.08741)
Keywords: diffusion, generative
Abstract: We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model, transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method, this approach significantly reduces model size, reduces memory usage, and accelerates training. Moreover, it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model, FPDM contains 87% fewer parameters, consumes 60% less memory during training, and improves image generation quality in situations where sampling computation or time is limited. Our code and pretrained models are available at https://lukemelas.github.io/fixed-point-diffusion-models.

Title: HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

Authors: Huanjun Kong, Songyang Zhang, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.08772
Pdf URL: https://arxiv.org/pdf/2401.08772
Copy Paste: [[2401.08772]] HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance(https://arxiv.org/abs/2401.08772)
Keywords: in-context
Abstract: In this work, we present HuixiangDou, a technical assistant powered by Large Language Models (LLM). This system is designed to assist algorithm developers by providing insightful responses to questions related to open-source algorithm projects, such as computer vision and deep learning projects from OpenMMLab. We further explore the integration of this assistant into the group chats of instant messaging (IM) tools such as WeChat and Lark. Through several iterative improvements and trials, we have developed a sophisticated technical chat assistant capable of effectively answering users' technical questions without causing message flooding. This paper's contributions include: 1) Designing an algorithm pipeline specifically for group chat scenarios; 2) Verifying the reliable performance of text2vec in task rejection; 3) Identifying three critical requirements for LLMs in technical-assistant-like products, namely scoring ability, In-Context Learning (ICL), and Long Context. We have made the software and source code available at https://github.com/internlm/huixiangdou to aid in future research and application. HuixiangDou is applicable to any group chat within IM tools.

Title: Segment Anything Model Can Not Segment Anything: Assessing AI Foundation Model's Generalizability in Permafrost Mapping

Authors: Wenwen Li, Chia-Yu Hsu, Sizhe Wang, Yezhou Yang, Hyunho Lee, Anna Liljedahl, Chandi Witharana, Yili Yang, Brendan M. Rogers, Samantha T. Arundel, Matthew B. Jones, Kenton McHenry, Patricia Solis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08787
Pdf URL: https://arxiv.org/pdf/2401.08787
Copy Paste: [[2401.08787]] Segment Anything Model Can Not Segment Anything: Assessing AI Foundation Model's Generalizability in Permafrost Mapping(https://arxiv.org/abs/2401.08787)
Keywords: foundation model
Abstract: This paper assesses trending AI foundation models, especially emerging computer vision foundation models and their performance in natural landscape feature segmentation. While the term foundation model has quickly garnered interest from the geospatial domain, its definition remains vague. Hence, this paper will first introduce AI foundation models and their defining characteristics. Built upon the tremendous success achieved by Large Language Models (LLMs) as the foundation models for language tasks, this paper discusses the challenges of building foundation models for geospatial artificial intelligence (GeoAI) vision tasks. To evaluate the performance of large AI vision models, especially Meta's Segment Anything Model (SAM), we implemented different instance segmentation pipelines that minimize the changes to SAM to leverage its power as a foundation model. A series of prompt strategies was developed to test SAM's performance regarding its theoretical upper bound of predictive accuracy, zero-shot performance, and domain adaptability through fine-tuning. The analysis used two permafrost feature datasets, ice-wedge polygons and retrogressive thaw slumps because (1) these landform features are more challenging to segment than manmade features due to their complicated formation mechanisms, diverse forms, and vague boundaries; (2) their presence and changes are important indicators for Arctic warming and climate change. The results show that although promising, SAM still has room for improvement to support AI-augmented terrain mapping. The spatial and domain generalizability of this finding is further validated using a more general dataset EuroCrop for agricultural field mapping. Finally, we discuss future research directions that strengthen SAM's applicability in challenging geospatial domains.

Title: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

Authors: Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2401.08815
Pdf URL: https://arxiv.org/pdf/2401.08815
Copy Paste: [[2401.08815]] Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive(https://arxiv.org/abs/2401.08815)
Keywords: diffusion
Abstract: Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

Title: Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

Authors: Qi Bi, Wei Ji, Jingjun Yi, Haolan Zhan, Gui-Song Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08860
Pdf URL: https://arxiv.org/pdf/2401.08860
Copy Paste: [[2401.08860]] Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization(https://arxiv.org/abs/2401.08860)
Keywords: self-supervised
Abstract: High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent researches find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-text representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle the challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained pre-text representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on CUB-200-2011, Stanford Cars and FGVC Aircraft show that the proposed method outperforms the contemporary method by upto 10.14% and existing state-of-the-art self-supervised learning approaches by upto 19.78% on both top-1 accuracy and Rank-1 retrieval metric.

Title: Robust Localization of Key Fob Using Channel Impulse Response of Ultra Wide Band Sensors for Keyless Entry Systems

Authors: Abhiram Kolli, Filippo Casamassima, Horst Possegger, Horst Bischof
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2401.08863
Pdf URL: https://arxiv.org/pdf/2401.08863
Copy Paste: [[2401.08863]] Robust Localization of Key Fob Using Channel Impulse Response of Ultra Wide Band Sensors for Keyless Entry Systems(https://arxiv.org/abs/2401.08863)
Keywords: self-supervised
Abstract: Using neural networks for localization of key fob within and surrounding a car as a security feature for keyless entry is fast emerging. In this paper we study: 1) the performance of pre-computed features of neural networks based UWB (ultra wide band) localization classification forming the baseline of our experiments. 2) Investigate the inherent robustness of various neural networks; therefore, we include the study of robustness of the adversarial examples without any adversarial training in this work. 3) Propose a multi-head self-supervised neural network architecture which outperforms the baseline neural networks without any adversarial training. The model's performance improved by 67% at certain ranges of adversarial magnitude for fast gradient sign method and 37% each for basic iterative method and projected gradient descent method.

Title: 3D Human Pose Analysis via Diffusion Synthesis

Authors: Haorui Ji, Hongdong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2401.08930
Pdf URL: https://arxiv.org/pdf/2401.08930
Copy Paste: [[2401.08930]] 3D Human Pose Analysis via Diffusion Synthesis(https://arxiv.org/abs/2401.08930)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated remarkable success in generative modeling. In this paper, we propose PADS (Pose Analysis by Diffusion Synthesis), a novel framework designed to address various challenges in 3D human pose analysis through a unified pipeline. Central to PADS are two distinctive strategies: i) learning a task-agnostic pose prior using a diffusion synthesis process to effectively capture the kinematic constraints in human pose data, and ii) unifying multiple pose analysis tasks like estimation, completion, denoising, etc, as instances of inverse problems. The learned pose prior will be treated as a regularization imposing on task-specific constraints, guiding the optimization process through a series of conditional denoising steps. PADS represents the first diffusion-based framework for tackling general 3D human pose analysis within the inverse problem framework. Its performance has been validated on different benchmarks, signaling the adaptability and robustness of this pipeline.

Title: COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Authors: Xiaotian Han, Yiqi Wang, Bohan Zhai, Quanzeng You, Hongxia Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08968
Pdf URL: https://arxiv.org/pdf/2401.08968
Copy Paste: [[2401.08968]] COCO is "ALL'' You Need for Visual Instruction Fine-tuning(https://arxiv.org/abs/2401.08968)
Keywords: generative
Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach: transforming existing datasets with rule-based templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet most effective IFT datasets today. Notably, when properly fine-tuned with this dataset, MLLMs can achieve state-of-the-art performance on several benchmarks. However, we noticed that models trained with this dataset often struggle to follow user instructions properly in multi-round dialog. In addition, tradition caption and VQA evaluation benchmarks, with their closed-form evaluation structure, are not fully equipped to assess the capabilities of modern open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k dataset, but may be a potential issue in all IFT datasets constructed from image captioning or VQA sources, though the extent of this issue may vary. We argue that datasets with diverse and high-quality detailed instruction following annotations are essential and adequate for MLLMs IFT. In this work, we establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions. Our experiments show that when fine-tuned with out proposed dataset, MLLMs achieve better performance on open-ended evaluation benchmarks in both single-round and multi-round dialog setting.

Title: Hearing Loss Detection from Facial Expressions in One-on-one Conversations

Authors: Yufeng Yin, Ishwarya Ananthabhotla, Vamsi Krishna Ithapu, Stavros Petridis, Yu-Hsiang Wu, Christi Miller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.08972
Pdf URL: https://arxiv.org/pdf/2401.08972
Copy Paste: [[2401.08972]] Hearing Loss Detection from Facial Expressions in One-on-one Conversations(https://arxiv.org/abs/2401.08972)
Keywords: self-supervised
Abstract: Individuals with impaired hearing experience difficulty in conversations, especially in noisy environments. This difficulty often manifests as a change in behavior and may be captured via facial expressions, such as the expression of discomfort or fatigue. In this work, we build on this idea and introduce the problem of detecting hearing loss from an individual's facial expressions during a conversation. Building machine learning models that can represent hearing-related facial expression changes is a challenge. In addition, models need to disentangle spurious age-related correlations from hearing-driven expressions. To this end, we propose a self-supervised pre-training strategy tailored for the modeling of expression variations. We also use adversarial representation learning to mitigate the age bias. We evaluate our approach on a large-scale egocentric dataset with real-world conversational scenarios involving subjects with hearing loss and show that our method for hearing loss detection achieves superior performance over baselines.

Title: ACT-GAN: Radio map construction based on generative adversarial networks with ACT blocks

Authors: Chen Qi, Yang Jingjing, Huang Ming, Zhou Qiang
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2401.08976
Pdf URL: https://arxiv.org/pdf/2401.08976
Copy Paste: [[2401.08976]] ACT-GAN: Radio map construction based on generative adversarial networks with ACT blocks(https://arxiv.org/abs/2401.08976)
Keywords: generative
Abstract: The radio map, serving as a visual representation of electromagnetic spatial characteristics, plays a pivotal role in assessment of wireless communication networks and radio monitoring coverage. Addressing the issue of low accuracy existing in the current radio map construction, this paper presents a novel radio map construction method based on generative adversarial network (GAN) in which the Aggregated Contextual-Transformation (AOT) block, Convolutional Block Attention Module (CBAM), and Transposed Convolution (T-Conv) block are applied to the generator, and we name it as ACT-GAN. It significantly improves the reconstruction accuracy and local texture of the radio maps. The performance of ACT-GAN across three different scenarios is demonstrated. Experiment results reveal that in the scenario without sparse discrete observations, the proposed method reduces the root mean square error (RMSE) by 14.6% in comparison to the state-of-the-art models. In the scenario with sparse discrete observations, the RMSE is diminished by 13.2%. Furthermore, the predictive results of the proposed model show a more lucid representation of electromagnetic spatial field distribution. To verify the universality of this model in radio map construction tasks, the scenario of unknown radio emission source is investigated. The results indicate that the proposed model is robust radio map construction and accurate in predicting the location of the emission source.

Title: A GAN-based data poisoning framework against anomaly detection in vertical federated learning

Authors: Xiaolin Chen, Daoguang Zan, Wei Li, Bei Guan, Yongji Wang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2401.08984
Pdf URL: https://arxiv.org/pdf/2401.08984
Copy Paste: [[2401.08984]] A GAN-based data poisoning framework against anomaly detection in vertical federated learning(https://arxiv.org/abs/2401.08984)
Keywords: anomaly
Abstract: In vertical federated learning (VFL), commercial entities collaboratively train a model while preserving data privacy. However, a malicious participant's poisoning attack may degrade the performance of this collaborative model. The main challenge in achieving the poisoning attack is the absence of access to the server-side top model, leaving the malicious participant without a clear target model. To address this challenge, we introduce an innovative end-to-end poisoning framework P-GAN. Specifically, the malicious participant initially employs semi-supervised learning to train a surrogate target model. Subsequently, this participant employs a GAN-based method to produce adversarial perturbations to degrade the surrogate target model's performance. Finally, the generator is obtained and tailored for VFL poisoning. Besides, we develop an anomaly detection algorithm based on a deep auto-encoder (DAE), offering a robust defense mechanism to VFL scenarios. Through extensive experiments, we evaluate the efficacy of P-GAN and DAE, and further analyze the factors that influence their performance.

Title: Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

Authors: Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2401.08992
Pdf URL: https://arxiv.org/pdf/2401.08992
Copy Paste: [[2401.08992]] Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR(https://arxiv.org/abs/2401.08992)
Keywords: foundation model
Abstract: The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.

Title: Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation

Authors: Tong Xie, Haoyu Li, Andrew Bai, Cho-Jui Hsieh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.09031
Pdf URL: https://arxiv.org/pdf/2401.09031
Copy Paste: [[2401.09031]] Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation(https://arxiv.org/abs/2401.09031)
Keywords: diffusion
Abstract: Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ``black-box'' neural networks. While prior research has established quantifiable links between model output and training data in diverse settings, interpreting diffusion model outputs in relation to training samples remains underexplored. In particular, diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts, posing a significant challenge to extend existing frameworks to diffusion models directly. Notably, we present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep. This trend leads to a prominent bias in influence estimation, and is particularly noticeable for samples trained on large-norm-inducing timesteps, causing them to be generally influential. To mitigate this effect, we introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest, facilitating a localized measurement of influence and considerably more intuitive visualization. We demonstrate the efficacy of our approach through various evaluation metrics and auxiliary tasks, reducing the amount of generally influential samples to $\frac{1}{3}$ of its original quantity.

Title: VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Authors: Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09047
Pdf URL: https://arxiv.org/pdf/2401.09047
Copy Paste: [[2401.09047]] VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models(https://arxiv.org/abs/2401.09047)
Keywords: diffusion
Abstract: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

Title: Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Authors: Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun Jeong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09048
Pdf URL: https://arxiv.org/pdf/2401.09048
Copy Paste: [[2401.09048]] Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis(https://arxiv.org/abs/2401.09048)
Keywords: diffusion
Abstract: Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

Title: Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

Authors: Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2401.09050
Pdf URL: https://arxiv.org/pdf/2401.09050
Copy Paste: [[2401.09050]] Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior(https://arxiv.org/abs/2401.09050)
Keywords: diffusion
Abstract: Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.

Title: CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

Authors: Yunze Liu, Changxi Chen, Zifan Wang, Li Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09057
Pdf URL: https://arxiv.org/pdf/2401.09057
Copy Paste: [[2401.09057]] CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding(https://arxiv.org/abs/2401.09057)
Keywords: self-supervised
Abstract: This paper introduces a novel approach named CrossVideo, which aims to enhance self-supervised cross-modal contrastive learning in the field of point cloud video understanding. Traditional supervised learning methods encounter limitations due to data scarcity and challenges in label acquisition. To address these issues, we propose a self-supervised learning method that leverages the cross-modal relationship between point cloud videos and image videos to acquire meaningful feature representations. Intra-modal and cross-modal contrastive learning techniques are employed to facilitate effective comprehension of point cloud video. We also propose a multi-level contrastive approach for both modalities. Through extensive experiments, we demonstrate that our method significantly surpasses previous state-of-the-art approaches, and we conduct comprehensive ablation studies to validate the effectiveness of our proposed designs.

Title: Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models

Authors: Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang, Deren Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09083
Pdf URL: https://arxiv.org/pdf/2401.09083
Copy Paste: [[2401.09083]] Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models(https://arxiv.org/abs/2401.09083)
Keywords: foundation model
Abstract: Recently, the flourishing large language models(LLM), especially ChatGPT, have shown exceptional performance in language understanding, reasoning, and interaction, attracting users and researchers from multiple fields and domains. Although LLMs have shown great capacity to perform human-like task accomplishment in natural language and natural image, their potential in handling remote sensing interpretation tasks has not yet been fully explored. Moreover, the lack of automation in remote sensing task planning hinders the accessibility of remote sensing interpretation techniques, especially to non-remote sensing experts from multiple research fields. To this end, we present Remote Sensing ChatGPT, an LLM-powered agent that utilizes ChatGPT to connect various AI-based remote sensing models to solve complicated interpretation tasks. More specifically, given a user request and a remote sensing image, we utilized ChatGPT to understand user requests, perform task planning according to the tasks' functions, execute each subtask iteratively, and generate the final response according to the output of each subtask. Considering that LLM is trained with natural language and is not capable of directly perceiving visual concepts as contained in remote sensing images, we designed visual cues that inject visual information into ChatGPT. With Remote Sensing ChatGPT, users can simply send a remote sensing image with the corresponding request, and get the interpretation results as well as language feedback from Remote Sensing ChatGPT. Experiments and examples show that Remote Sensing ChatGPT can tackle a wide range of remote sensing tasks and can be extended to more tasks with more sophisticated models such as the remote sensing foundation model. The code and demo of Remote Sensing ChatGPT is publicly available at https://github.com/HaonanGuo/Remote-Sensing-ChatGPT .

Title: UniVG: Towards UNIfied-modal Video Generation

Authors: Ludan Ruan, Lei Tian, Chuanwei Huang, Xu Zhang, Xinyan Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09084
Pdf URL: https://arxiv.org/pdf/2401.09084
Copy Paste: [[2401.09084]] UniVG: Towards UNIfied-modal Video Generation(https://arxiv.org/abs/2401.09084)
Keywords: diffusion, generative
Abstract: Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.

Title: Machine Learning for Healthcare-IoT Security: A Review and Risk Mitigation

Authors: Mirza Akhi Khatun, Sanober Farheen Memon, Ciarán Eising, Lubna Luxmi Dhirani
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2401.09124
Pdf URL: https://arxiv.org/pdf/2401.09124
Copy Paste: [[2401.09124]] Machine Learning for Healthcare-IoT Security: A Review and Risk Mitigation(https://arxiv.org/abs/2401.09124)
Keywords: generative
Abstract: The Healthcare Internet-of-Things (H-IoT), commonly known as Digital Healthcare, is a data-driven infrastructure that highly relies on smart sensing devices (i.e., blood pressure monitors, temperature sensors, etc.) for faster response time, treatments, and diagnosis. However, with the evolving cyber threat landscape, IoT devices have become more vulnerable to the broader risk surface (e.g., risks associated with generative AI, 5G-IoT, etc.), which, if exploited, may lead to data breaches, unauthorized access, and lack of command and control and potential harm. This paper reviews the fundamentals of healthcare IoT, its privacy, and data security challenges associated with machine learning and H-IoT devices. The paper further emphasizes the importance of monitoring healthcare IoT layers such as perception, network, cloud, and application. Detecting and responding to anomalies involves various cyber-attacks and protocols such as Wi-Fi 6, Narrowband Internet of Things (NB-IoT), Bluetooth, ZigBee, LoRa, and 5G New Radio (5G NR). A robust authentication mechanism based on machine learning and deep learning techniques is required to protect and mitigate H-IoT devices from increasing cybersecurity vulnerabilities. Hence, in this review paper, security and privacy challenges and risk mitigation strategies for building resilience in H-IoT are explored and reported.

Title: SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Authors: Haowen Wang, Zhen Zhao, Zhao Jin, Zhengping Che, Liang Qiao, Yakun Huang, Zhipeng Fan, Xiuquan Qiao, Jian Tang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2401.09133
Pdf URL: https://arxiv.org/pdf/2401.09133
Copy Paste: [[2401.09133]] SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects(https://arxiv.org/abs/2401.09133)
Keywords: self-supervised
Abstract: Reconstructing real-world objects and estimating their movable joint structures are pivotal technologies within the field of robotics. Previous research has predominantly focused on supervised approaches, relying on extensively annotated datasets to model articulated objects within limited categories. However, this approach falls short of effectively addressing the diversity present in the real world. To tackle this issue, we propose a self-supervised interaction perception method, referred to as SM$^3$, which leverages multi-view RGB images captured before and after interaction to model articulated objects, identify the movable parts, and infer the parameters of their rotating joints. By constructing 3D geometries and textures from the captured 2D images, SM$^3$ achieves integrated optimization of movable part and joint parameters during the reconstruction process, obviating the need for annotations. Furthermore, we introduce the MMArt dataset, an extension of PartNet-Mobility, encompassing multi-view and multi-modal data of articulated objects spanning diverse categories. Evaluations demonstrate that SM$^3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.

Title: Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder

Authors: Almudévar Antonio, Mariotte Théo, Ortega Alfonso, Tahon Marie
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2401.09180
Pdf URL: https://arxiv.org/pdf/2401.09180
Copy Paste: [[2401.09180]] Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder(https://arxiv.org/abs/2401.09180)
Keywords: generative
Abstract: Unsupervised Multiple Domain Translation is the task of transforming data from one domain to other domains without having paired data to train the systems. Typically, methods based on Generative Adversarial Networks (GANs) are used to address this task. However, our proposal exclusively relies on a modified version of a Variational Autoencoder. This modification consists of the use of two latent variables disentangled in a controlled way by design. One of this latent variables is imposed to depend exclusively on the domain, while the other one must depend on the rest of the variability factors of the data. Additionally, the conditions imposed over the domain latent variable allow for better control and understanding of the latent space. We empirically demonstrate that our approach works on different vision datasets improving the performance of other well known methods. Finally, we prove that, indeed, one of the latent variables stores all the information related to the domain and the other one hardly contains any domain information.

Title: Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Authors: Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09195
Pdf URL: https://arxiv.org/pdf/2401.09195
Copy Paste: [[2401.09195]] Training-Free Semantic Video Composition via Pre-trained Diffusion Model(https://arxiv.org/abs/2401.09195)
Keywords: diffusion
Abstract: The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities. Specifically, we process the video frames in a cascading manner and handle each frame in two processes with the diffusion model. In the inversion process, we propose Balanced Partial Inversion to obtain generation initial points that balance reversibility and modifiability. Then, in the generation process, we further propose Inter-Frame Augmented attention to augment foreground continuity across frames. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs, demonstrating efficacy in managing broader semantic disparities.

Title: Siamese Meets Diffusion Network: SMDNet for Enhanced Change Detection in High-Resolution RS Imagery

Authors: Jia Jia, Geunho Lee, Zhibo Wang, Lyu Zhi, Yuchu He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09325
Pdf URL: https://arxiv.org/pdf/2401.09325
Copy Paste: [[2401.09325]] Siamese Meets Diffusion Network: SMDNet for Enhanced Change Detection in High-Resolution RS Imagery(https://arxiv.org/abs/2401.09325)
Keywords: diffusion
Abstract: Recently, the application of deep learning to change detection (CD) has significantly progressed in remote sensing images. In recent years, CD tasks have mostly used architectures such as CNN and Transformer to identify these changes. However, these architectures have shortcomings in representing boundary details and are prone to false alarms and missed detections under complex lighting and weather conditions. For that, we propose a new network, Siamese Meets Diffusion Network (SMDNet). This network combines the Siam-U2Net Feature Differential Encoder (SU-FDE) and the denoising diffusion implicit model to improve the accuracy of image edge change detection and enhance the model's robustness under environmental changes. First, we propose an innovative SU-FDE module that utilizes shared weight features to capture differences between time series images and identify similarities between features to enhance edge detail detection. Furthermore, we add an attention mechanism to identify key coarse features to improve the model's sensitivity and accuracy. Finally, the diffusion model of progressive sampling is used to fuse key coarse features, and the noise reduction ability of the diffusion model and the advantages of capturing the probability distribution of image data are used to enhance the adaptability of the model in different environments. Our method's combination of feature extraction and diffusion models demonstrates effectiveness in change detection in remote sensing images. The performance evaluation of SMDNet on LEVIR-CD, DSIFN-CD, and CDD datasets yields validated F1 scores of 90.99%, 88.40%, and 88.47%, respectively. This substantiates the advanced capabilities of our model in accurately identifying variations and intricate details.

Title: POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Authors: Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2401.09413
Pdf URL: https://arxiv.org/pdf/2401.09413
Copy Paste: [[2401.09413]] POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images(https://arxiv.org/abs/2401.09413)
Keywords: self-supervised
Abstract: We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.

Title: Vlogger: Make Your Dream A Vlog

Authors: Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2401.09414
Pdf URL: https://arxiv.org/pdf/2401.09414
Copy Paste: [[2401.09414]] Vlogger: Make Your Dream A Vlog(https://arxiv.org/abs/2401.09414)
Keywords: diffusion, foundation model
Abstract: In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. Moreover, we introduce a novel video diffusion model, ShowMaker, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor. The code and model is all available at https://github.com/zhuangshaobin/Vlogger.

Title: TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion

Authors: Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, Zhengqin Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2401.09416
Pdf URL: https://arxiv.org/pdf/2401.09416
Copy Paste: [[2401.09416]] TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion(https://arxiv.org/abs/2401.09416)
Keywords: diffusion
Abstract: We present TextureDreamer, a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry, while learning-based methods are confined to category-specific shapes within the dataset. In contrast, TextureDreamer can transfer highly detailed, intricate textures from real-world environments to arbitrary objects with only a few casually captured images, potentially significantly democratizing texture creation. Our core idea, personalized geometry-aware score distillation (PGSD), draws inspiration from recent advancements in diffuse models, including personalized modeling for texture information extraction, variational score distillation for detailed appearance synthesis, and explicit geometry guidance with ControlNet. Our integration and several essential modifications substantially improve the texture quality. Experiments on real images spanning different categories show that TextureDreamer can successfully transfer highly realistic, semantic meaningful texture to arbitrary objects, surpassing the visual quality of previous state-of-the-art.

Title: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2401.09417
Pdf URL: https://arxiv.org/pdf/2401.09417
Copy Paste: [[2401.09417]] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model(https://arxiv.org/abs/2401.09417)
Keywords: foundation model
Abstract: Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.