2024-08-22

Title: Tabular Transfer Learning via Prompting LLMs

Authors: Jaehyun Nam, Woomin Song, Seong Hyeon Park, Jihoon Tack, Sukmin Yun, Jaehyung Kim, Kyu Hwan Oh, Jinwoo Shin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11063
Pdf URL: https://arxiv.org/pdf/2408.11063
Copy Paste: [[2408.11063]] Tabular Transfer Learning via Prompting LLMs(https://arxiv.org/abs/2408.11063)
Keywords: in-context
Abstract: Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer learning of tabular tasks, which has been less studied and successful in the literature, compared to other domains, e.g., vision and language. This is because tables are inherently heterogeneous, i.e., they contain different columns and feature spaces, making transfer learning difficult. On the other hand, recent advances in natural language processing suggest that the label scarcity issue can be mitigated by utilizing in-context learning capability of large language models (LLMs). Inspired by this and the fact that LLMs can also process tables within a unified language space, we ask whether LLMs can be effective for tabular transfer learning, in particular, under the scenarios where the source and target datasets are of different format. As a positive answer, we propose a novel tabular transfer learning framework, coined Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with LLMs. Specifically, P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts. Experimental results demonstrate that P2T outperforms previous methods on various tabular learning benchmarks, showing good promise for the important, yet underexplored tabular transfer learning problem. Code is available at this https URL.

Title: DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Authors: Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo, Kaidi Xu
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2408.11071
Pdf URL: https://arxiv.org/pdf/2408.11071
Copy Paste: [[2408.11071]] DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization(https://arxiv.org/abs/2408.11071)
Keywords: diffusion, generative
Abstract: Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of \textit{purely black-box} attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.

Title: GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Authors: Changkun Liu, Shuai Chen, Yash Bhalgat, Siyan Hu, Zirui Wang, Ming Cheng, Victor Adrian Prisacariu, Tristan Braud
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11085
Pdf URL: https://arxiv.org/pdf/2408.11085
Copy Paste: [[2408.11085]] GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting(https://arxiv.org/abs/2408.11085)
Keywords: foundation model
Abstract: We leverage 3D Gaussian Splatting (3DGS) as a scene representation and propose a novel test-time camera pose refinement framework, GSLoc. This framework enhances the localization accuracy of state-of-the-art absolute pose regression and scene coordinate regression methods. The 3DGS model renders high-quality synthetic images and depth maps to facilitate the establishment of 2D-3D correspondences. GSLoc obviates the need for training feature extractors or descriptors by operating directly on RGB images, utilizing the 3D vision foundation model, MASt3R, for precise 2D matching. To improve the robustness of our model in challenging outdoor environments, we incorporate an exposure-adaptive module within the 3DGS framework. Consequently, GSLoc enables efficient pose refinement given a single RGB query and a coarse initial pose estimation. Our proposed approach surpasses leading NeRF-based optimization methods in both accuracy and runtime across indoor and outdoor visual localization benchmarks, achieving state-of-the-art accuracy on two indoor datasets.

Title: MS$^3$D: A RG Flow-Based Regularization for GAN Training with Limited Data

Authors: Jian Wang, Xin Lan, Yuxin Tian, Jiancheng Lv
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11135
Pdf URL: https://arxiv.org/pdf/2408.11135
Copy Paste: [[2408.11135]] MS$^3$D: A RG Flow-Based Regularization for GAN Training with Limited Data(https://arxiv.org/abs/2408.11135)
Keywords: generative
Abstract: Generative adversarial networks (GANs) have made impressive advances in image generation, but they often require large-scale training data to avoid degradation caused by discriminator overfitting. To tackle this issue, we investigate the challenge of training GANs with limited data, and propose a novel regularization method based on the idea of renormalization group (RG) in physics.We observe that in the limited data setting, the gradient pattern that the generator obtains from the discriminator becomes more aggregated over time. In RG context, this aggregated pattern exhibits a high discrepancy from its coarse-grained versions, which implies a high-capacity and sensitive system, prone to overfitting and collapse. To address this problem, we introduce a \textbf{m}ulti-\textbf{s}cale \textbf{s}tructural \textbf{s}elf-\textbf{d}issimilarity (MS$^3$D) regularization, which constrains the gradient field to have a consistent pattern across different scales, thereby fostering a more redundant and robust system. We show that our method can effectively enhance the performance and stability of GANs under limited data scenarios, and even allow them to generate high-quality images with very few data.

Title: Total Uncertainty Quantification in Inverse PDE Solutions Obtained with Reduced-Order Deep Learning Surrogate Models

Authors: Yuanzhe Wang, Alexandre M. Tartakovsky
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.11145
Pdf URL: https://arxiv.org/pdf/2408.11145
Copy Paste: [[2408.11145]] Total Uncertainty Quantification in Inverse PDE Solutions Obtained with Reduced-Order Deep Learning Surrogate Models(https://arxiv.org/abs/2408.11145)
Keywords: diffusion
Abstract: We propose an approximate Bayesian method for quantifying the total uncertainty in inverse PDE solutions obtained with machine learning surrogate models, including operator learning models. The proposed method accounts for uncertainty in the observations and PDE and surrogate models. First, we use the surrogate model to formulate a minimization problem in the reduced space for the maximum a posteriori (MAP) inverse solution. Then, we randomize the MAP objective function and obtain samples of the posterior distribution by minimizing different realizations of the objective function. We test the proposed framework by comparing it with the iterative ensemble smoother and deep ensembling methods for a non-linear diffusion equation with an unknown space-dependent diffusion coefficient. Among other problems, this equation describes groundwater flow in an unconfined aquifer. Depending on the training dataset and ensemble sizes, the proposed method provides similar or more descriptive posteriors of the parameters and states than the iterative ensemble smoother method. Deep ensembling underestimates uncertainty and provides less informative posteriors than the other two methods.

Title: Compress Guidance in Conditional Diffusion Sampling

Authors: Anh-Dung Dinh, Daochang Liu, Chang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11194
Pdf URL: https://arxiv.org/pdf/2408.11194
Copy Paste: [[2408.11194]] Compress Guidance in Conditional Diffusion Sampling(https://arxiv.org/abs/2408.11194)
Keywords: diffusion, generative
Abstract: Enforcing guidance throughout the entire sampling process often proves counterproductive due to the model-fitting issue., where samples are generated to match the classifier's parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing the guidance densely in the early stages of the process, we observe a significant improvement in image quality and diversity while also reducing the required guidance timesteps by nearly 40%. This approach addresses a major challenge in applying guidance effectively to generative tasks. Consequently, our proposed method, termed Compress Guidance, allows for the exclusion of a substantial number of guidance timesteps while still surpassing baseline models in image quality. We validate our approach through benchmarks on label conditional and text-to-image generative tasks across various datasets and models.

Title: UKAN: Unbound Kolmogorov-Arnold Network Accompanied with Accelerated Library

Authors: Alireza Moradzadeh, Lukasz Wawrzyniak, Miles Macklin, Saee G. Paliwal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.11200
Pdf URL: https://arxiv.org/pdf/2408.11200
Copy Paste: [[2408.11200]] UKAN: Unbound Kolmogorov-Arnold Network Accompanied with Accelerated Library(https://arxiv.org/abs/2408.11200)
Keywords: generative
Abstract: In this work, we present a GPU-accelerated library for the underlying components of Kolmogorov-Arnold Networks (KANs), along with an algorithm to eliminate bounded grids in KANs. The GPU-accelerated library reduces the computational complexity of Basis Spline (B-spline) evaluation by a factor of $\mathcal{O}$(grid size) compared to existing codes, enabling batch computation for large-scale learning. To overcome the limitations of traditional KANs, we introduce Unbounded KANs (UKANs), which eliminate the need for a bounded grid and a fixed number of B-spline coefficients. To do so, we replace the KAN parameters (B-spline coefficients) with a coefficient generator (CG) model. The inputs to the CG model are designed based on the idea of an infinite symmetric grid extending from negative infinity to positive infinity. The positional encoding of grid group, a sequential collection of B-spline grid indexes, is fed into the CG model, and coefficients are consumed by the efficient implementation (matrix representations) of B-spline functions to generate outputs. We perform several experiments on regression, classification, and generative tasks, which are promising. In particular, UKAN does not require data normalization or a bounded domain for evaluation. Additionally, our benchmarking results indicate the superior memory and computational efficiency of our library compared to existing codes.

Title: PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Authors: Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11208
Pdf URL: https://arxiv.org/pdf/2408.11208
Copy Paste: [[2408.11208]] PooDLe: Pooled and dense self-supervised learning from naturalistic videos(https://arxiv.org/abs/2408.11208)
Keywords: self-supervised
Abstract: Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

Title: On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

Authors: Sadia Ilyas, Ido Freeman, Matthias Rottmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11221
Pdf URL: https://arxiv.org/pdf/2408.11221
Copy Paste: [[2408.11221]] On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes(https://arxiv.org/abs/2408.11221)
Keywords: anomaly
Abstract: Out-of-distribution (OOD) object detection is a critical task focused on detecting objects that originate from a data distribution different from that of the training data. In this study, we investigate to what extent state-of-the-art open-vocabulary object detectors can detect unusual objects in street scenes, which are considered as OOD or rare scenarios with respect to common street scene datasets. Specifically, we evaluate their performance on the OoDIS Benchmark, which extends RoadAnomaly21 and RoadObstacle21 from SegmentMeIfYouCan, as well as LostAndFound, which was recently extended to object level annotations. The objective of our study is to uncover short-comings of contemporary object detectors in challenging real-world, and particularly in open-world scenarios. Our experiments reveal that open vocabulary models are promising for OOD object detection scenarios, however far from perfect. Substantial improvements are required before they can be reliably deployed in real-world applications. We benchmark four state-of-the-art open-vocabulary object detection models on three different datasets. Noteworthily, Grounding DINO achieves the best results on RoadObstacle21 and LostAndFound in our study with an AP of 48.3% and 25.4% respectively. YOLO-World excels on RoadAnomaly21 with an AP of 21.2%.

Title: CooPre: Cooperative Pretraining for V2X Cooperative Perception

Authors: Seth Z. Zhao, Hao Xiang, Chenfeng Xu, Xin Xia, Bolei Zhou, Jiaqi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11241
Pdf URL: https://arxiv.org/pdf/2408.11241
Copy Paste: [[2408.11241]] CooPre: Cooperative Pretraining for V2X Cooperative Perception(https://arxiv.org/abs/2408.11241)
Keywords: self-supervised
Abstract: Existing Vehicle-to-Everything (V2X) cooperative perception methods rely on accurate multi-agent 3D annotations. Nevertheless, it is time-consuming and expensive to collect and annotate real-world data, especially for V2X systems. In this paper, we present a self-supervised learning method for V2X cooperative perception, which utilizes the vast amount of unlabeled 3D V2X data to enhance the perception performance. Beyond simply extending the previous pre-training methods for point-cloud representation learning, we introduce a novel self-supervised Cooperative Pretraining framework (termed as CooPre) customized for a collaborative scenario. We point out that cooperative point-cloud sensing compensates for information loss among agents. This motivates us to design a novel proxy task for the 3D encoder to reconstruct LiDAR point clouds across different agents. Besides, we develop a V2X bird-eye-view (BEV) guided masking strategy which effectively allows the model to pay attention to 3D features across heterogeneous V2X agents (i.e., vehicles and infrastructure) in the BEV space. Noticeably, such a masking strategy effectively pretrains the 3D encoder and is compatible with mainstream cooperative perception backbones. Our approach, validated through extensive experiments on representative datasets (i.e., V2X-Real, V2V4Real, and OPV2V), leads to a performance boost across all V2X settings. Additionally, we demonstrate the framework's improvements in cross-domain transferability, data efficiency, and robustness under challenging scenarios. The code will be made publicly available.

Title: Do Neural Scaling Laws Exist on Graph Self-Supervised Learning?

Authors: Qian Ma, Haitao Mao, Jingzhe Liu, Zhehua Zhang, Chunlin Feng, Yu Song, Yihan Shao, Tianfan Fu, Yao Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11243
Pdf URL: https://arxiv.org/pdf/2408.11243
Copy Paste: [[2408.11243]] Do Neural Scaling Laws Exist on Graph Self-Supervised Learning?(https://arxiv.org/abs/2408.11243)
Keywords: self-supervised, foundation model
Abstract: Self-supervised learning~(SSL) is essential to obtain foundation models in NLP and CV domains via effectively leveraging knowledge in large-scale unlabeled data. The reason for its success is that a suitable SSL design can help the model to follow the neural scaling law, i.e., the performance consistently improves with increasing model and dataset sizes. However, it remains a mystery whether existing SSL in the graph domain can follow the scaling behavior toward building Graph Foundation Models~(GFMs) with large-scale pre-training. In this study, we examine whether existing graph SSL techniques can follow the neural scaling behavior with the potential to serve as the essential component for GFMs. Our benchmark includes comprehensive SSL technique implementations with analysis conducted on both the conventional SSL setting and many new settings adopted in other domains. Surprisingly, despite the SSL loss continuously decreasing, no existing graph SSL techniques follow the neural scaling behavior on the downstream performance. The model performance only merely fluctuates on different data scales and model scales. Instead of the scales, the key factors influencing the performance are the choices of model architecture and pretext task design. This paper examines existing SSL techniques for the feasibility of Graph SSL techniques in developing GFMs and opens a new direction for graph SSL design with the new evaluation prototype. Our code implementation is available online to ease reproducibility on this https URL.

Title: Taming Generative Diffusion for Universal Blind Image Restoration

Authors: Siwei Tu, Weidong Yang, Ben Fei
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11287
Pdf URL: https://arxiv.org/pdf/2408.11287
Copy Paste: [[2408.11287]] Taming Generative Diffusion for Universal Blind Image Restoration(https://arxiv.org/abs/2408.11287)
Keywords: diffusion, generative
Abstract: Diffusion models have been widely utilized for image restoration. However, previous blind image restoration methods still need to assume the type of degradation model while leaving the parameters to be optimized, limiting their real-world applications. Therefore, we aim to tame generative diffusion prior for universal blind image restoration dubbed BIR-D, which utilizes an optimizable convolutional kernel to simulate the degradation model and dynamically update the parameters of the kernel in the diffusion steps, enabling it to achieve blind image restoration results even in various complex situations. Besides, based on mathematical reasoning, we have provided an empirical formula for the chosen of adaptive guidance scale, eliminating the need for a grid search for the optimal parameter. Experimentally, Our BIR-D has demonstrated superior practicality and versatility than off-the-shelf unsupervised methods across various tasks both on real-world and synthetic datasets, qualitatively and quantitatively. BIR-D is able to fulfill multi-guidance blind image restoration. Moreover, BIR-D can also restore images that undergo multiple and complicated degradations, demonstrating the practical applications.

Title: UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Authors: Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, Xiao-Ming Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11305
Pdf URL: https://arxiv.org/pdf/2408.11305
Copy Paste: [[2408.11305]] UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation(https://arxiv.org/abs/2408.11305)
Keywords: diffusion, generative
Abstract: The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at this https URL.

Title: TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Authors: Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11318
Pdf URL: https://arxiv.org/pdf/2408.11318
Copy Paste: [[2408.11318]] TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models(https://arxiv.org/abs/2408.11318)
Keywords: self-supervised, foundation model
Abstract: In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA~(ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available on "this https URL.

Title: HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

Authors: Yi Wang, Jian Ma, Ruizhi Shao, Qiao Feng, Yu-kun Lai, Kun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11357
Pdf URL: https://arxiv.org/pdf/2408.11357
Copy Paste: [[2408.11357]] HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model(https://arxiv.org/abs/2408.11357)
Keywords: diffusion
Abstract: This paper aims to generate physically-layered 3D humans from text prompts. Existing methods either generate 3D clothed humans as a whole or support only tight and simple clothing generation, which limits their applications to virtual try-on and part-level editing. To achieve physically-layered 3D human generation with reusable and complex clothing, we propose a novel layer-wise dressed human representation based on a physically-decoupled diffusion model. Specifically, to achieve layer-wise clothing generation, we propose a dual-representation decoupling framework for generating clothing decoupled from the human body, in conjunction with an innovative multi-layer fusion volume rendering method. To match the clothing with different body shapes, we propose an SMPL-driven implicit field deformation network that enables the free transfer and reuse of clothing. Extensive experiments demonstrate that our approach not only achieves state-of-the-art layered 3D human generation with complex clothing but also supports virtual try-on and layered human animation.

Title: Hypergraph Learning based Recommender System for Anomaly Detection, Control and Optimization

Authors: Sakhinana Sagar Srinivas, Rajat Kumar Sarkar, Venkataramana Runkana
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11359
Pdf URL: https://arxiv.org/pdf/2408.11359
Copy Paste: [[2408.11359]] Hypergraph Learning based Recommender System for Anomaly Detection, Control and Optimization(https://arxiv.org/abs/2408.11359)
Keywords: self-supervised, anomaly
Abstract: Anomaly detection is fundamental yet, challenging problem with practical applications in industry. The current approaches neglect the higher-order dependencies within the networks of interconnected sensors in the high-dimensional time series(multisensor data) for anomaly detection. To this end, we present a self-adapting anomaly detection framework for joint learning of (a) discrete hypergraph structure and (b) modeling the temporal trends and spatial relations among the interdependent sensors using the hierarchical encoder-decoder architecture to overcome the challenges. The hypergraph representation learning-based framework exploits the relational inductive biases in the hypergraph-structured data to learn the pointwise single-step-ahead forecasts through the self-supervised autoregressive task and predicts the anomalies based on the forecast error. Furthermore, our framework incentivizes learning the anomaly-diagnosis ontology through a differentiable approach. It derives the anomaly information propagation-based computational hypergraphs for root cause analysis and provides recommendations through an offline, optimal predictive control policy to remedy an anomaly. We conduct extensive experiments to evaluate the proposed method on the benchmark datasets for fair and rigorous comparison with the popular baselines. The proposed method outperforms the baseline models and achieves SOTA performance. We report the ablation studies to support the efficacy of the framework.

Title: Video Diffusion Models are Strong Video Inpainter

Authors: Minhyeok Lee, Suhwan Cho, Chajin Shin, Jungho Lee, Sunghun Yang, Sangyoun Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11402
Pdf URL: https://arxiv.org/pdf/2408.11402
Copy Paste: [[2408.11402]] Video Diffusion Models are Strong Video Inpainter(https://arxiv.org/abs/2408.11402)
Keywords: diffusion
Abstract: Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.

Title: Latent Feature and Attention Dual Erasure Attack against Multi-View Diffusion Models for 3D Assets Protection

Authors: Jingwei Sun, Xuchong Zhang, Changfeng Sun, Qicheng Bai, Hongbin Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11408
Pdf URL: https://arxiv.org/pdf/2408.11408
Copy Paste: [[2408.11408]] Latent Feature and Attention Dual Erasure Attack against Multi-View Diffusion Models for 3D Assets Protection(https://arxiv.org/abs/2408.11408)
Keywords: diffusion
Abstract: Multi-View Diffusion Models (MVDMs) enable remarkable improvements in the field of 3D geometric reconstruction, but the issue regarding intellectual property has received increasing attention due to unauthorized imitation. Recently, some works have utilized adversarial attacks to protect copyright. However, all these works focus on single-image generation tasks which only need to consider the inner feature of images. Previous methods are inefficient in attacking MVDMs because they lack the consideration of disrupting the geometric and visual consistency among the generated multi-view images. This paper is the first to address the intellectual property infringement issue arising from MVDMs. Accordingly, we propose a novel latent feature and attention dual erasure attack to disrupt the distribution of latent feature and the consistency across the generated images from multi-view and multi-domain simultaneously. The experiments conducted on SOTA MVDMs indicate that our approach achieves superior performances in terms of attack effectiveness, transferability, and robustness against defense methods. Therefore, this paper provides an efficient solution to protect 3D assets from MVDMs-based 3D geometry reconstruction.

Title: SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

Authors: Wei Shang, Dongwei Ren, Wanying Zhang, Qilong Wang, Pengfei Zhu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11411
Pdf URL: https://arxiv.org/pdf/2408.11411
Copy Paste: [[2408.11411]] SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction(https://arxiv.org/abs/2408.11411)
Keywords: self-supervised
Abstract: Modern consumer cameras commonly employ the rolling shutter (RS) imaging mechanism, via which images are captured by scanning scenes row-by-row, resulting in RS distortion for dynamic scenes. To correct RS distortion, existing methods adopt a fully supervised learning manner that requires high framerate global shutter (GS) images as ground-truth for supervision. In this paper, we propose an enhanced Self-supervised learning framework for Dual reversed RS distortion Correction (SelfDRSC++). Firstly, we introduce a lightweight DRSC network that incorporates a bidirectional correlation matching block to refine the joint optimization of optical flows and corrected RS features, thereby improving correction performance while reducing network parameters. Subsequently, to effectively train the DRSC network, we propose a self-supervised learning strategy that ensures cycle consistency between input and reconstructed dual reversed RS images. The RS reconstruction in SelfDRSC++ can be interestingly formulated as a specialized instance of video frame interpolation, where each row in reconstructed RS images is interpolated from predicted GS images by utilizing RS distortion time maps. By achieving superior performance while simplifying the training process, SelfDRSC++ enables feasible one-stage self-supervised training. Additionally, besides start and end RS scanning time, SelfDRSC++ allows supervision of GS images at arbitrary intermediate scanning times, thus enabling the learned DRSC network to generate high framerate GS videos. The code and trained models are available at \url{this https URL}.

Title: Pano2Room: Novel View Synthesis from a Single Indoor Panorama

Authors: Guo Pu, Yiming Zhao, Zhouhui Lian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2408.11413
Pdf URL: https://arxiv.org/pdf/2408.11413
Copy Paste: [[2408.11413]] Pano2Room: Novel View Synthesis from a Single Indoor Panorama(https://arxiv.org/abs/2408.11413)
Keywords: generative
Abstract: Recent single-view 3D generative methods have made significant advancements by leveraging knowledge distilled from extensive 3D object datasets. However, challenges persist in the synthesis of 3D scenes from a single view, primarily due to the complexity of real-world environments and the limited availability of high-quality prior resources. In this paper, we introduce a novel approach called Pano2Room, designed to automatically reconstruct high-quality 3D indoor scenes from a single panoramic image. These panoramic images can be easily generated using a panoramic RGBD inpainter from captures at a single location with any camera. The key idea is to initially construct a preliminary mesh from the input panorama, and iteratively refine this mesh using a panoramic RGBD inpainter while collecting photo-realistic 3D-consistent pseudo novel views. Finally, the refined mesh is converted into a 3D Gaussian Splatting field and trained with the collected pseudo novel views. This pipeline enables the reconstruction of real-world 3D scenes, even in the presence of large occlusions, and facilitates the synthesis of photo-realistic novel views with detailed geometry. Extensive qualitative and quantitative experiments have been conducted to validate the superiority of our method in single-panorama indoor novel synthesis compared to the state-of-the-art. Our code and data are available at \url{this https URL}.

Title: Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory

Authors: Simon Münker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11415
Pdf URL: https://arxiv.org/pdf/2408.11415
Copy Paste: [[2408.11415]] Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory(https://arxiv.org/abs/2408.11415)
Keywords: generative, in-context
Abstract: Contemporary research in social sciences is increasingly utilizing state-of-the-art statistical language models to annotate or generate content. While these models perform benchmark-leading on common language tasks and show exemplary task-independent emergent abilities, transferring them to novel out-of-domain tasks is only insufficiently explored. The implications of the statistical black-box approach - stochastic parrots - are prominently criticized in the language model research community; however, the significance for novel generative tasks is not. This work investigates the alignment between personalized language models and survey participants on a Moral Foundation Theory questionnaire. We adapt text-to-text models to different political personas and survey the questionnaire repetitively to generate a synthetic population of persona and model combinations. Analyzing the intra-group variance and cross-alignment shows significant differences across models and personas. Our findings indicate that adapted models struggle to represent the survey-captured assessment of political ideologies. Thus, using language models to mimic social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes. Without quantifiable alignment, generating politically nuanced content remains unfeasible. To enhance these representations, we propose a testable framework to generate agents based on moral value statements for future research.

Title: T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Authors: Yili Li, Jing Yu, Keke Gai, Bang Liu, Gang Xiong, Qi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11432
Pdf URL: https://arxiv.org/pdf/2408.11432
Copy Paste: [[2408.11432]] T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval(https://arxiv.org/abs/2408.11432)
Keywords: generative
Abstract: Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at this https URL.

Title: GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Authors: Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, Naoto Yokoya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11447
Pdf URL: https://arxiv.org/pdf/2408.11447
Copy Paste: [[2408.11447]] GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting(https://arxiv.org/abs/2408.11447)
Keywords: self-supervised
Abstract: We introduce GaussianOcc, a systematic method that investigates the two usages of Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps, semantic maps), which is both time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering).

Title: MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation

Authors: Kim Yu-Ji, Hyunwoo Ha, Kim Youwang, Jaeheung Surh, Hyowon Ha, Tae-Hyun Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11465
Pdf URL: https://arxiv.org/pdf/2408.11465
Copy Paste: [[2408.11465]] MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation(https://arxiv.org/abs/2408.11465)
Keywords: generative
Abstract: Reconstructing 3D from a single view image is a long-standing challenge. One of the popular approaches to tackle this problem is learning-based methods, but dealing with the test cases unfamiliar with training data (Out-of-distribution; OoD) introduces an additional challenge. To adapt for unseen samples in test time, we propose MeTTA, a test-time adaptation (TTA) exploiting generative prior. We design joint optimization of 3D geometry, appearance, and pose to handle OoD cases with only a single view image. However, the alignment between the reference image and the 3D shape via the estimated viewpoint could be erroneous, which leads to ambiguity. To address this ambiguity, we carefully design learnable virtual cameras and their self-calibration. In our experiments, we demonstrate that MeTTA effectively deals with OoD scenarios at failure cases of existing learning-based 3D reconstruction models and enables obtaining a realistic appearance with physically based rendering (PBR) textures.

Title: TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Authors: Haitao Zhou, Chuang Wang, Rui Nie, Jinxiao Lin, Dongdong Yu, Qian Yu, Changhu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11475
Pdf URL: https://arxiv.org/pdf/2408.11475
Copy Paste: [[2408.11475]] TrackGo: A Flexible and Efficient Method for Controllable Video Generation(https://arxiv.org/abs/2408.11475)
Keywords: diffusion
Abstract: Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores. The project page of TrackGo can be found at: this https URL

Title: Imagining from Images with an AI Storytelling Tool

Authors: Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11517
Pdf URL: https://arxiv.org/pdf/2408.11517
Copy Paste: [[2408.11517]] Imagining from Images with an AI Storytelling Tool(https://arxiv.org/abs/2408.11517)
Keywords: diffusion
Abstract: A method for generating narratives by analyzing single images or image sequences is presented, inspired by the time immemorial tradition of Narrative Art. The proposed method explores the multimodal capabilities of GPT-4o to interpret visual content and create engaging stories, which are illustrated by a Stable Diffusion XL model. The method is supported by a fully implemented tool, called ImageTeller, which accepts images from diverse sources as input. Users can guide the narrative's development according to the conventions of fundamental genres - such as Comedy, Romance, Tragedy, Satire or Mystery -, opt to generate data-driven stories, or to leave the prototype free to decide how to handle the narrative structure. User interaction is provided along the generation process, allowing the user to request alternative chapters or illustrations, and even reject and restart the story generation based on the same input. Additionally, users can attach captions to the input images, influencing the system's interpretation of the visual content. Examples of generated stories are provided, along with details on how to access the prototype.

Title: Just Project! Multi-Channel Despeckling, the Easy Way

Authors: Loïc Denis, Emanuele Dalsasso (EPFL), Florence Tupin (IMAGES, IDS)
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2408.11531
Pdf URL: https://arxiv.org/pdf/2408.11531
Copy Paste: [[2408.11531]] Just Project! Multi-Channel Despeckling, the Easy Way(https://arxiv.org/abs/2408.11531)
Keywords: self-supervised
Abstract: Reducing speckle fluctuations in multi-channel SAR images is essential in many applications of SAR imaging such as polarimetric classification or interferometric height estimation. While single-channel despeckling has widely benefited from the application of deep learning techniques, extensions to multi-channel SAR images are much more challenging.This paper introduces MuChaPro, a generic framework that exploits existing single-channel despeckling methods. The key idea is to generate numerous single-channel projections, restore these projections, and recombine them into the final multi-channel estimate. This simple approach is shown to be effective in polarimetric and/or interferometric modalities. A special appeal of MuChaPro is the possibility to apply a self-supervised training strategy to learn sensor-specific networks for single-channel despeckling.

Title: Memorization In In-Context Learning

Authors: Shahriar Golchin, Mihai Surdeanu, Steven Bethard, Eduardo Blanco, Ellen Riloff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11546
Pdf URL: https://arxiv.org/pdf/2408.11546
Copy Paste: [[2408.11546]] Memorization In In-Context Learning(https://arxiv.org/abs/2408.11546)
Keywords: in-context
Abstract: In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind these performance improvements remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers a hidden phenomenon -- memorization -- at the core of ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?

Title: AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Authors: Yunfang Niu, Lingxiang Wu, Dong Yi, Jie Peng, Ning Jiang, Haiying Wu, Jinqiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11553
Pdf URL: https://arxiv.org/pdf/2408.11553
Copy Paste: [[2408.11553]] AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion(https://arxiv.org/abs/2408.11553)
Keywords: diffusion
Abstract: Fashion image editing aims to modify a person's appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.

Title: Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

Authors: Duc-Hai Pham, Duc Dung Nguyen, Hoang-Anh Pham, Ho Lai Tuan, Phong Ha Nguyen, Khoi Nguyen, Rang Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11559
Pdf URL: https://arxiv.org/pdf/2408.11559
Copy Paste: [[2408.11559]] Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance(https://arxiv.org/abs/2408.11559)
Keywords: foundation model
Abstract: Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.

Title: Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control

Authors: Muhammad Aqeel, Shakiba Sharifi, Marco Cristani, Francesco Setti
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11561
Pdf URL: https://arxiv.org/pdf/2408.11561
Copy Paste: [[2408.11561]] Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control(https://arxiv.org/abs/2408.11561)
Keywords: self-supervised, anomaly
Abstract: This study introduces the Iterative Refinement Process (IRP), a robust anomaly detection methodology designed for high-stakes industrial quality control. The IRP enhances defect detection accuracy through a cyclic data refinement strategy, iteratively removing misleading data points to improve model performance and robustness. We validate the IRP's effectiveness using two benchmark datasets, Kolektor SDD2 (KSDD2) and MVTec AD, covering a wide range of industrial products and defect types. Our experimental results demonstrate that the IRP consistently outperforms traditional anomaly detection models, particularly in environments with high noise levels. This study highlights the IRP's potential to significantly enhance anomaly detection processes in industrial settings, effectively managing the challenges of sparse and noisy data.

Title: AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Authors: Minheng Ni, Chenfei Wu, Huaying Yuan, Zhengyuan Yang, Ming Gong, Lijuan Wang, Zicheng Liu, Wangmeng Zuo, Nan Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11564
Pdf URL: https://arxiv.org/pdf/2408.11564
Copy Paste: [[2408.11564]] AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition(https://arxiv.org/abs/2408.11564)
Keywords: generative
Abstract: With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism. However, the approach to generate multi-sensory outputs has not been fully explored, limiting the application on high-value scenarios such as of directing a film. Developing a movie director agent faces two major challenges: (1) Lack of parallelism and online scheduling with production steps: In the production of multi-sensory films, there are complex dependencies between different sensory elements, and the production time for each element varies. (2) Diverse needs and clear communication demands with users: Users often cannot clearly express their needs until they see a draft, which requires human-computer interaction and iteration to continually adjust and optimize the film content based on user feedback. To address these issues, we introduce AutoDirector, an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing. This framework improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs. AutoDirector not only expands the application scope of human-machine collaboration but also demonstrates the potential of AI in collaborating with humans in the role of a film director to complete multi-sensory films.

Title: Robust 3D Gaussian Splatting for Novel View Synthesis in Presence of Distractors

Authors: Paul Ungermann, Armin Ettenhofer, Matthias Nießner, Barbara Roessle
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11697
Pdf URL: https://arxiv.org/pdf/2408.11697
Copy Paste: [[2408.11697]] Robust 3D Gaussian Splatting for Novel View Synthesis in Presence of Distractors(https://arxiv.org/abs/2408.11697)
Keywords: self-supervised
Abstract: 3D Gaussian Splatting has shown impressive novel view synthesis results; nonetheless, it is vulnerable to dynamic objects polluting the input data of an otherwise static scene, so called distractors. Distractors have severe impact on the rendering quality as they get represented as view-dependent effects or result in floating artifacts. Our goal is to identify and ignore such distractors during the 3D Gaussian optimization to obtain a clean reconstruction. To this end, we take a self-supervised approach that looks at the image residuals during the optimization to determine areas that have likely been falsified by a distractor. In addition, we leverage a pretrained segmentation network to provide object awareness, enabling more accurate exclusion of distractors. This way, we obtain segmentation masks of distractors to effectively ignore them in the loss formulation. We demonstrate that our approach is robust to various distractors and strongly improves rendering quality on distractor-polluted scenes, improving PSNR by 1.86dB compared to 3D Gaussian Splatting.

Title: FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Authors: Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohan Sai Singamsetti, Fengyu Sun, Wei Lu, Di Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11706
Pdf URL: https://arxiv.org/pdf/2408.11706
Copy Paste: [[2408.11706]] FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting(https://arxiv.org/abs/2408.11706)
Keywords: diffusion
Abstract: Text-to-image (T2I) diffusion models have demonstrated impressive capabilities in generating high-quality images given a text prompt. However, ensuring the prompt-image alignment remains a considerable challenge, i.e., generating images that faithfully align with the prompt's semantics. Recent works attempt to improve the faithfulness by optimizing the latent code, which potentially could cause the latent code to go out-of-distribution and thus produce unrealistic images. In this paper, we propose FRAP, a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images. We design an online algorithm to adaptively update each token's weight coefficient, which is achieved by minimizing a unified objective function that encourages object presence and the binding of object-modifier pairs. Through extensive evaluations, we show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets, while having a lower average latency compared to recent latent code optimization methods, e.g., 4 seconds faster than D&B on the COCO-Subject dataset. Furthermore, through visual comparisons and evaluation on the CLIP-IQA-Real metric, we show that FRAP not only improves prompt-image alignment but also generates more authentic images with realistic appearances. We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment, where we observe improvements in both prompt-image alignment and image quality.

Title: Iterative Object Count Optimization for Text-to-image Diffusion Models

Authors: Oz Zafar, Lior Wolf, Idan Schwartz
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11721
Pdf URL: https://arxiv.org/pdf/2408.11721
Copy Paste: [[2408.11721]] Iterative Object Count Optimization for Text-to-image Diffusion Models(https://arxiv.org/abs/2408.11721)
Keywords: diffusion
Abstract: We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an objectś potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at this https URL.

Title: Sum of Squares Circuits

Authors: Lorenzo Loconte, Stefan Mengel, Antonio Vergari
Subjects: cs.LG, cs.AI, cs.CC, math.AG
Abstract URL: https://arxiv.org/abs/2408.11778
Pdf URL: https://arxiv.org/pdf/2408.11778
Copy Paste: [[2408.11778]] Sum of Squares Circuits(https://arxiv.org/abs/2408.11778)
Keywords: generative
Abstract: Designing expressive generative models that support exact and efficient inference is a core question in probabilistic ML. Probabilistic circuits (PCs) offer a framework where this tractability-vs-expressiveness trade-off can be analyzed theoretically. Recently, squared PCs encoding subtractive mixtures via negative parameters have emerged as tractable models that can be exponentially more expressive than monotonic PCs, i.e., PCs with positive parameters only. In this paper, we provide a more precise theoretical characterization of the expressiveness relationships among these models. First, we prove that squared PCs can be less expressive than monotonic ones. Second, we formalize a novel class of PCs -- sum of squares PCs -- that can be exponentially more expressive than both squared and monotonic PCs. Around sum of squares PCs, we build an expressiveness hierarchy that allows us to precisely unify and separate different tractable model classes such as Born Machines and PSD models, and other recently introduced tractable probabilistic models by using complex parameters. Finally, we empirically show the effectiveness of sum of squares circuits in performing distribution estimation.

Title: Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

Authors: Haipeng Zhou, Honqiu Wang, Tian Ye, Zhaohu Xing, Jun Ma, Ping Li, Qiong Wang, Lei Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11785
Pdf URL: https://arxiv.org/pdf/2408.11785
Copy Paste: [[2408.11785]] Timeline and Boundary Guided Diffusion Network for Video Shadow Detection(https://arxiv.org/abs/2408.11785)
Keywords: diffusion
Abstract: Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \url{this https URL}.

Title: Practical token pruning for foundation models in few-shot conversational virtual assistant systems

Authors: Haode Qi, Cheng Qian, Jian Ni, Pratyush Singh, Reza Fazeli, Gengyu Wang, Zhongzheng Shu, Eric Wayne, Juergen Bross
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11799
Pdf URL: https://arxiv.org/pdf/2408.11799
Copy Paste: [[2408.11799]] Practical token pruning for foundation models in few-shot conversational virtual assistant systems(https://arxiv.org/abs/2408.11799)
Keywords: foundation model
Abstract: In an enterprise Virtual Assistant (VA) system, intent classification is the crucial component that determines how a user input is handled based on what the user wants. The VA system is expected to be a cost-efficient SaaS service with low training and inference time while achieving high accuracy even with a small number of training samples. We pretrain a transformer-based sentence embedding model with a contrastive learning objective and leverage the embedding of the model as features when training intent classification models. Our approach achieves the state-of-the-art results for few-shot scenarios and performs better than other commercial solutions on popular intent classification benchmarks. However, generating features via a transformer-based model increases the inference time, especially for longer user inputs, due to the quadratic runtime of the transformer's attention mechanism. On top of model distillation, we introduce a practical multi-task adaptation approach that configures dynamic token pruning without the need for task-specific training for intent classification. We demonstrate that this approach improves the inference speed of popular sentence transformer models without affecting model performance.

Title: Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models

Authors: Chun-Yen Shih, Li-Xuan Peng, Jia-Wei Liao, Ernie Chu, Cheng-Fu Chou, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11810
Pdf URL: https://arxiv.org/pdf/2408.11810
Copy Paste: [[2408.11810]] Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models(https://arxiv.org/abs/2408.11810)
Keywords: diffusion, generative
Abstract: Diffusion Models have emerged as powerful generative models for high-quality image synthesis, with many subsequent image editing techniques based on them. However, the ease of text-based image editing introduces significant risks, such as malicious editing for scams or intellectual property infringement. Previous works have attempted to safeguard images from diffusion-based editing by adding imperceptible perturbations. These methods are costly and specifically target prevalent Latent Diffusion Models (LDMs), while Pixel-domain Diffusion Models (PDMs) remain largely unexplored and robust against such attacks. Our work addresses this gap by proposing a novel attacking framework with a feature representation attack loss that exploits vulnerabilities in denoising UNets and a latent optimization strategy to enhance the naturalness of protected images. Extensive experiments demonstrate the effectiveness of our approach in attacking dominant PDM-based editing methods (e.g., SDEdit) while maintaining reasonable protection fidelity and robustness against common defense methods. Additionally, our framework is extensible to LDMs, achieving comparable performance to existing approaches.

Title: EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Authors: Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2408.11811
Pdf URL: https://arxiv.org/pdf/2408.11811
Copy Paste: [[2408.11811]] EmbodiedSAM: Online Segment Any 3D Thing in Real Time(https://arxiv.org/abs/2408.11811)
Keywords: foundation model
Abstract: Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at this https URL, with only one RTX 3090 GPU required for training and evaluation.