Copy Paste: [[2408.12747]] CatFree3D: Category-agnostic 3D Object Detection with Diffusion(https://arxiv.org/abs/2408.12747)
Keywords: diffusion
Abstract: Image-based 3D object detection is widely employed in applications such as autonomous vehicles and robotics, yet current systems struggle with generalisation due to complex problem setup and limited training data. We introduce a novel pipeline that decouples 3D detection from 2D detection and depth prediction, using a diffusion-based approach to improve accuracy and support category-agnostic detection. Additionally, we introduce the Normalised Hungarian Distance (NHD) metric for an accurate evaluation of 3D detection results, addressing the limitations of traditional IoU and GIoU metrics. Experimental results demonstrate that our method achieves state-of-the-art accuracy and strong generalisation across various object categories and datasets.
Title: Contrastive Representation Learning for Dynamic Link Prediction in Temporal Networks
Authors: Amirhossein Nouranizadeh, Fatemeh Tabatabaei Far, Mohammad Rahmati
Copy Paste: [[2408.12753]] Contrastive Representation Learning for Dynamic Link Prediction in Temporal Networks(https://arxiv.org/abs/2408.12753)
Keywords: self-supervised
Abstract: Evolving networks are complex data structures that emerge in a wide range of systems in science and engineering. Learning expressive representations for such networks that encode their structural connectivity and temporal evolution is essential for downstream data analytics and machine learning applications. In this study, we introduce a self-supervised method for learning representations of temporal networks and employ these representations in the dynamic link prediction task. While temporal networks are typically characterized as a sequence of interactions over the continuous time domain, our study focuses on their discrete-time versions. This enables us to balance the trade-off between computational complexity and precise modeling of the interactions. We propose a recurrent message-passing neural network architecture for modeling the information flow over time-respecting paths of temporal networks. The key feature of our method is the contrastive training objective of the model, which is a combination of three loss functions: link prediction, graph reconstruction, and contrastive predictive coding losses. The contrastive predictive coding objective is implemented using infoNCE losses at both local and global scales of the input graphs. We empirically show that the additional self-supervised losses enhance the training and improve the model's performance in the dynamic link prediction task. The proposed method is tested on Enron, COLAB, and Facebook datasets and exhibits superior results compared to existing models.
Title: Symmetric masking strategy enhances the performance of Masked Image Modeling
Copy Paste: [[2408.12772]] Symmetric masking strategy enhances the performance of Masked Image Modeling(https://arxiv.org/abs/2408.12772)
Keywords: self-supervised
Abstract: Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9\% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.
Title: Investigating LLM Applications in E-Commerce
Copy Paste: [[2408.12779]] Investigating LLM Applications in E-Commerce(https://arxiv.org/abs/2408.12779)
Keywords: in-context
Abstract: The emergence of Large Language Models (LLMs) has revolutionized natural language processing in various applications especially in e-commerce. One crucial step before the application of such LLMs in these fields is to understand and compare the performance in different use cases in such tasks. This paper explored the efficacy of LLMs in the e-commerce domain, focusing on instruction-tuning an open source LLM model with public e-commerce datasets of varying sizes and comparing the performance with the conventional models prevalent in industrial applications. We conducted a comprehensive comparison between LLMs and traditional pre-trained language models across specific tasks intrinsic to the e-commerce domain, namely classification, generation, summarization, and named entity recognition (NER). Furthermore, we examined the effectiveness of the current niche industrial application of very large LLM, using in-context learning, in e-commerce specific tasks. Our findings indicate that few-shot inference with very large LLMs often does not outperform fine-tuning smaller pre-trained models, underscoring the importance of task-specific model optimization.Additionally, we investigated different training methodologies such as single-task training, mixed-task training, and LoRA merging both within domain/tasks and between different tasks. Through rigorous experimentation and analysis, this paper offers valuable insights into the potential effectiveness of LLMs to advance natural language processing capabilities within the e-commerce industry.
Title: Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora
Copy Paste: [[2408.12799]] Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora(https://arxiv.org/abs/2408.12799)
Keywords: generative
Abstract: Ambiguity in language presents challenges in developing more enhanced language models, particularly in preference learning, where variability among annotators results in inconsistently annotated datasets used for model alignment. To address this issue, we introduce a self-curation method that preprocesses annotated datasets by leveraging proxy models trained directly on these datasets. Our method enhances preference learning by automatically detecting and removing ambiguous annotations within the dataset. The proposed approach is validated through extensive experiments, demonstrating a marked improvement in performance across various instruction-following tasks. Our work provides a straightforward and reliable method to overcome annotation inconsistencies, serving as an initial step towards the development of more advanced preference learning techniques.
Title: Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks
Authors: Yusuf Usman, Aadesh Upadhyay, Prashnna Gyawali, Robin Chataut
Copy Paste: [[2408.12806]] Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks(https://arxiv.org/abs/2408.12806)
Keywords: generative
Abstract: In an era where digital threats are increasingly sophisticated, the intersection of Artificial Intelligence and cybersecurity presents both promising defenses and potent dangers. This paper delves into the escalating threat posed by the misuse of AI, specifically through the use of Large Language Models (LLMs). This study details various techniques like the switch method and character play method, which can be exploited by cybercriminals to generate and automate cyber attacks. Through a series of controlled experiments, the paper demonstrates how these models can be manipulated to bypass ethical and privacy safeguards to effectively generate cyber attacks such as social engineering, malicious code, payload generation, and spyware. By testing these AI generated attacks on live systems, the study assesses their effectiveness and the vulnerabilities they exploit, offering a practical perspective on the risks AI poses to critical infrastructure. We also introduce Occupy AI, a customized, finetuned LLM specifically engineered to automate and execute cyberattacks. This specialized AI driven tool is adept at crafting steps and generating executable code for a variety of cyber threats, including phishing, malware injection, and system exploitation. The results underscore the urgency for ethical AI practices, robust cybersecurity measures, and regulatory oversight to mitigate AI related threats. This paper aims to elevate awareness within the cybersecurity community about the evolving digital threat landscape, advocating for proactive defense strategies and responsible AI development to protect against emerging cyber threats.
Title: Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery
Authors: Zhenyuan Yang, Xuhui Lin, Qinyi He, Ziye Huang, Zhengliang Liu, Hanqi Jiang, Peng Shu, Zihao Wu, Yiwei Li, Stephen Law, Gengchen Mai, Tianming Liu, Tao Yang
Copy Paste: [[2408.12821]] Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery(https://arxiv.org/abs/2408.12821)
Keywords: foundation model
Abstract: The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding, but highlight limitations in detailed recognition and counting tasks. While zero-shot learning shows potential, performance varies depending on the problem domains and image complexities. This study provides new insights into the strengths and weaknesses of multimodal foundation models for practical challenges in Street View Imagery, Built Environment, and Interior. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications at the intersection of computer vision and language.
Title: Smooth InfoMax -- Towards easier Post-Hoc interpretability
Copy Paste: [[2408.12936]] Smooth InfoMax -- Towards easier Post-Hoc interpretability(https://arxiv.org/abs/2408.12936)
Keywords: self-supervised
Abstract: We introduce Smooth InfoMax (SIM), a novel method for self-supervised representation learning that incorporates an interpretability constraint into the learned representations at various depths of the neural network. SIM's architecture is split up into probabilistic modules, each locally optimized using the InfoNCE bound. Inspired by VAEs, the representations from these modules are designed to be samples from Gaussian distributions and are further constrained to be close to the standard normal distribution. This results in a smooth and predictable space, enabling traversal of the latent space through a decoder for easier post-hoc analysis of the learned representations. We evaluate SIM's performance on sequential speech data, showing that it performs competitively with its less interpretable counterpart, Greedy InfoMax (GIM). Moreover, we provide insights into SIM's internal representations, demonstrating that the contained information is less entangled throughout the representation and more concentrated in a smaller subset of the dimensions. This further highlights the improved interpretability of SIM.
Title: Causal-Guided Active Learning for Debiasing Large Language Models
Authors: Zhouhao Sun, Li Du, Xiao Ding, Yixuan Ma, Kaitao Qiu, Ting Liu, Bing Qin
Copy Paste: [[2408.12942]] Causal-Guided Active Learning for Debiasing Large Language Models(https://arxiv.org/abs/2408.12942)
Keywords: generative, in-context
Abstract: Although achieving promising performance, recent analyses show that current generative large language models (LLMs) may still capture dataset biases and utilize them for generation, leading to poor generalizability and harmfulness of LLMs. However, due to the diversity of dataset biases and the over-optimization problem, previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs. To address this issue, we explore combining active learning with the causal mechanisms and propose a casual-guided active learning (CAL) framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns. Then a cost-effective and efficient in-context learning based method is employed to prevent LLMs from utilizing dataset biases during generation. Experimental results show that CAL can effectively recognize typical biased instances and induce various bias patterns for debiasing LLMs.
Title: State-of-the-Art Fails in the Art of Damage Detection
Authors: Daniela Ivanova, Marco Aversa, Paul Henderson, John Williamson
Copy Paste: [[2408.12953]] State-of-the-Art Fails in the Art of Damage Detection(https://arxiv.org/abs/2408.12953)
Keywords: diffusion
Abstract: Accurately detecting and classifying damage in analogue media such as paintings, photographs, textiles, mosaics, and frescoes is essential for cultural heritage preservation. While machine learning models excel in correcting global degradation if the damage operator is known a priori, we show that they fail to predict where the damage is even after supervised training; thus, reliable damage detection remains a challenge. We introduce DamBench, a dataset for damage detection in diverse analogue media, with over 11,000 annotations covering 15 damage types across various subjects and media. We evaluate CNN, Transformer, and text-guided diffusion segmentation models, revealing their limitations in generalising across media types.
Title: Image Segmentation in Foundation Model Era: A Survey
Authors: Tianfei Zhou, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, Daniel Cremers
Copy Paste: [[2408.12957]] Image Segmentation in Foundation Model Era: A Survey(https://arxiv.org/abs/2408.12957)
Keywords: diffusion, foundation model
Abstract: Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research -- generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) -- by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research. We envisage that this fresh, comprehensive, and systematic survey catalyzes the evolution of advanced image segmentation systems.
Abstract: The rapid growth of Large Language Models (LLMs) usage has highlighted the importance of gradient-free in-context learning (ICL). However, interpreting their inner workings remains challenging. This paper introduces a novel multimodal contrastive in-context learning framework to enhance our understanding of ICL in LLMs. First, we present a contrastive learning-based interpretation of ICL in real-world settings, marking the distance of the key-value representation as the differentiator in ICL. Second, we develop an analytical framework to address biases in multimodal input formatting for real-world datasets. We demonstrate the effectiveness of ICL examples where baseline performance is poor, even when they are represented in unseen formats. Lastly, we propose an on-the-fly approach for ICL (Anchored-by-Text ICL) that demonstrates effectiveness in detecting hateful memes, a task where typical ICL struggles due to resource limitations. Extensive experiments on multimodal datasets reveal that our approach significantly improves ICL performance across various scenarios, such as challenging tasks and resource-constrained environments. Moreover, it provides valuable insights into the mechanisms of in-context learning in LLMs. Our findings have important implications for developing more interpretable, efficient, and robust multimodal AI systems, especially in challenging tasks and resource-constrained environments.
Title: EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Copy Paste: [[2408.13005]] EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation(https://arxiv.org/abs/2408.13005)
Keywords: diffusion
Abstract: Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described with plain text. Fortunately, in computer vision, various visual representations can serve as additional control signals to guide generation. With the help of these signals, video generation can be controlled in finer detail, allowing for greater flexibility for different applications. Integrating various controls, however, is nontrivial. In this paper, we propose a universal framework called EasyControl. By propagating and injecting condition features through condition adapters, our method enables users to control video generation with a single condition map. With our framework, various conditions including raw pixels, depth, HED, etc., can be integrated into different Unet-based pre-trained video diffusion models at a low practical cost. We conduct comprehensive experiments on public datasets, and both quantitative and qualitative results indicate that our method outperforms state-of-the-art methods. EasyControl significantly improves various evaluation metrics across multiple validation datasets compared to previous works. Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152.0 on FVD and 19.9 on IS, respectively, in UCF101 compared with VideoComposer. For fidelity, our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT compared to other image-to-video models.
Title: In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting
Copy Paste: [[2408.13028]] In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting(https://arxiv.org/abs/2408.13028)
Keywords: in-context
Abstract: In-context learning (ICL) of large language models (LLMs) has attracted increasing attention in the community where LLMs make predictions only based on instructions augmented with a few examples. Existing example selection methods for ICL utilize sparse or dense retrievers and derive effective performance. However, these methods do not utilize direct feedback of LLM to train the retriever and the examples selected can not necessarily improve the analogy ability of LLM. To tackle this, we propose our policy-based reinforcement learning framework for example selection (RLS), which consists of a language model (LM) selector and an LLM generator. The LM selector encodes the candidate examples into dense representations and selects the top-k examples into the demonstration for LLM. The outputs of LLM are adopted to compute the reward and policy gradient to optimize the LM selector. We conduct experiments on different datasets and significantly outperform existing example selection methods. Moreover, our approach shows advantages over supervised finetuning (SFT) models in few shot setting. Further experiments show the balance of abundance and the similarity with the test case of examples is important for ICL performance of LLM.
Title: VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models
Copy Paste: [[2408.13031]] VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models(https://arxiv.org/abs/2408.13031)
Keywords: foundation model
Abstract: Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle's semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using VehicleMAE. More importantly, we propose a new VAtt2Vec module that predicts the vehicle semantic attributes of these proposals and transforms them into feature vectors to enhance the vision features via contrastive learning. Extensive experiments on three vehicle detection benchmark datasets thoroughly proved the effectiveness of our vehicle detector. Specifically, our model improves the baseline approach by $+5.1\%$, $+6.2\%$ on the $AP_{0.5}$, $AP_{0.75}$ metrics, respectively, on the Cityscapes dataset.The source code of this work will be released at this https URL.
Title: G3FA: Geometry-guided GAN for Face Animation
Copy Paste: [[2408.13049]] G3FA: Geometry-guided GAN for Face Animation(https://arxiv.org/abs/2408.13049)
Keywords: generative
Abstract: Animating human face images aims to synthesize a desired source identity in a natural-looking way mimicking a driving video's facial movements. In this context, Generative Adversarial Networks have demonstrated remarkable potential in real-time face reenactment using a single source image, yet are constrained by limited geometry consistency compared to graphic-based approaches. In this paper, we introduce Geometry-guided GAN for Face Animation (G3FA) to tackle this limitation. Our novel approach empowers the face animation model to incorporate 3D information using only 2D images, improving the image generation capabilities of the talking head synthesis model. We integrate inverse rendering techniques to extract 3D facial geometry properties, improving the feedback loop to the generator through a weighted average ensemble of discriminators. In our face reenactment model, we leverage 2D motion warping to capture motion dynamics along with orthogonal ray sampling and volume rendering techniques to produce the ultimate visual output. To evaluate the performance of our G3FA, we conducted comprehensive experiments using various evaluation protocols on VoxCeleb2 and TalkingHead benchmarks to demonstrate the effectiveness of our proposed framework compared to the state-of-the-art real-time face animation methods.
Title: Atlas Gaussians Diffusion for 3D Generation with Infinite Number of Points
Copy Paste: [[2408.13055]] Atlas Gaussians Diffusion for 3D Generation with Infinite Number of Points(https://arxiv.org/abs/2408.13055)
Keywords: diffusion
Abstract: Using the latent diffusion model has proven effective in developing novel 3D generation techniques. To harness the latent diffusion model, a key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space. In this paper, we introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation. Atlas Gaussians represent a shape as the union of local patches, and each patch can decode 3D Gaussians. We parameterize a patch as a sequence of feature vectors and design a learnable function to decode 3D Gaussians from the feature vectors. In this process, we incorporate UV-based sampling, enabling the generation of a sufficiently large, and theoretically infinite, number of 3D Gaussian points. The large amount of 3D Gaussians enables high-quality details of generation results. Moreover, due to local awareness of the representation, the transformer-based decoding procedure operates on a patch level, ensuring efficiency. We train a variational autoencoder to learn the Atlas Gaussians representation, and then apply a latent diffusion model on its latent space for learning 3D Generation. Experiments show that our approach outperforms the prior arts of feed-forward native 3D generation.
Title: Multivariate Time-Series Anomaly Detection based on Enhancing Graph Attention Networks with Topological Analysis
Copy Paste: [[2408.13082]] Multivariate Time-Series Anomaly Detection based on Enhancing Graph Attention Networks with Topological Analysis(https://arxiv.org/abs/2408.13082)
Keywords: anomaly
Abstract: Unsupervised anomaly detection in time series is essential in industrial applications, as it significantly reduces the need for manual intervention. Multivariate time series pose a complex challenge due to their feature and temporal dimensions. Traditional methods use Graph Neural Networks (GNNs) or Transformers to analyze spatial while RNNs to model temporal dependencies. These methods focus narrowly on one dimension or engage in coarse-grained feature extraction, which can be inadequate for large datasets characterized by intricate relationships and dynamic changes. This paper introduces a novel temporal model built on an enhanced Graph Attention Network (GAT) for multivariate time series anomaly detection called TopoGDN. Our model analyzes both time and feature dimensions from a fine-grained perspective. First, we introduce a multi-scale temporal convolution module to extract detailed temporal features. Additionally, we present an augmented GAT to manage complex inter-feature dependencies, which incorporates graph topology into node features across multiple scales, a versatile, plug-and-play enhancement that significantly boosts the performance of GAT. Our experimental results confirm that our approach surpasses the baseline models on four datasets, demonstrating its potential for widespread application in fields requiring robust anomaly detection. The code is available at this https URL.
Title: Diffusion-based Episodes Augmentation for Offline Multi-Agent Reinforcement Learning
Authors: Jihwan Oh, Sungnyun Kim, Gahee Kim, Sunghwan Kim, Se-Young Yun
Abstract: Offline multi-agent reinforcement learning (MARL) is increasingly recognized as crucial for effectively deploying RL algorithms in environments where real-time interaction is impractical, risky, or costly. In the offline setting, learning from a static dataset of past interactions allows for the development of robust and safe policies without the need for live data collection, which can be fraught with challenges. Building on this foundational importance, we present EAQ, Episodes Augmentation guided by Q-total loss, a novel approach for offline MARL framework utilizing diffusion models. EAQ integrates the Q-total function directly into the diffusion model as a guidance to maximize the global returns in an episode, eliminating the need for separate training. Our focus primarily lies on cooperative scenarios, where agents are required to act collectively towards achieving a shared goal-essentially, maximizing global returns. Consequently, we demonstrate that our episodes augmentation in a collaborative manner significantly boosts offline MARL algorithm compared to the original dataset, improving the normalized return by +17.3% and +12.9% for medium and poor behavioral policies in SMAC simulator, respectively.
Title: IFH: a Diffusion Framework for Flexible Design of Graph Generative Models
Authors: Samuel Cognolato, Alessandro Sperduti, Luciano Serafini
Copy Paste: [[2408.13194]] IFH: a Diffusion Framework for Flexible Design of Graph Generative Models(https://arxiv.org/abs/2408.13194)
Keywords: diffusion, generative
Abstract: Graph generative models can be classified into two prominent families: one-shot models, which generate a graph in one go, and sequential models, which generate a graph by successive additions of nodes and edges. Ideally, between these two extreme models lies a continuous range of models that adopt different levels of sequentiality. This paper proposes a graph generative model, called Insert-Fill-Halt (IFH), that supports the specification of a sequentiality degree. IFH is based upon the theory of Denoising Diffusion Probabilistic Models (DDPM), designing a node removal process that gradually destroys a graph. An insertion process learns to reverse this removal process by inserting arcs and nodes according to the specified sequentiality degree. We evaluate the performance of IFH in terms of quality, run time, and memory, depending on different sequentiality degrees. We also show that using DiGress, a diffusion-based one-shot model, as a generative step in IFH leads to improvement to the model itself, and is competitive with the current state-of-the-art.
Title: CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Authors: Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li
Copy Paste: [[2408.13239]] CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities(https://arxiv.org/abs/2408.13239)
Keywords: diffusion
Abstract: Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.
Title: LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
Authors: Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Yixuan Li, Gordon Wetzstein, Ziwei Liu, Dahua Lin
Copy Paste: [[2408.13252]] LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation(https://arxiv.org/abs/2408.13252)
Keywords: diffusion
Abstract: 3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) we introduce a novel text-guided anchor view synthesis pipeline for high-quality, consistent panorama generation. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications.