2024-11-08

Title: DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation

Authors: Hao Phung, Quan Dao, Trung Dao, Hoang Phan, Dimitris Metaxas, Anh Tran
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04168
Pdf URL: https://arxiv.org/pdf/2411.04168
Copy Paste: [[2411.04168]] DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation(https://arxiv.org/abs/2411.04168)
Keywords: diffusion
Abstract: We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at this https URL.

Title: Quantum Diffusion Models for Few-Shot Learning

Authors: Ruhan Wang, Ye Wang, Jing Liu, Toshiaki Koike-Akino
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04217
Pdf URL: https://arxiv.org/pdf/2411.04217
Copy Paste: [[2411.04217]] Quantum Diffusion Models for Few-Shot Learning(https://arxiv.org/abs/2411.04217)
Keywords: diffusion
Abstract: Modern quantum machine learning (QML) methods involve the variational optimization of parameterized quantum circuits on training datasets, followed by predictions on testing datasets. Most state-of-the-art QML algorithms currently lack practical advantages due to their limited learning capabilities, especially in few-shot learning tasks. In this work, we propose three new frameworks employing quantum diffusion model (QDM) as a solution for the few-shot learning: label-guided generation inference (LGGI); label-guided denoising inference (LGDI); and label-guided noise addition inference (LGNAI). Experimental results demonstrate that our proposed algorithms significantly outperform existing methods.

Title: PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Authors: Siddharth Seth, Rishabh Dabral, Diogo Luvizon, Marc Habermann, Ming-Hsuan Yang, Christian Theobalt, Adam Kortylewski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04249
Pdf URL: https://arxiv.org/pdf/2411.04249
Copy Paste: [[2411.04249]] PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing(https://arxiv.org/abs/2411.04249)
Keywords: diffusion, generative
Abstract: Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco -- the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing -- important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at this https URL .

Title: Generative Discrete Event Process Simulation for Hidden Markov Models to Predict Competitor Time-to-Market

Authors: Nandakishore Santhi, Stephan Eidenbenz, Brian Key, George Tompkins
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.04266
Pdf URL: https://arxiv.org/pdf/2411.04266
Copy Paste: [[2411.04266]] Generative Discrete Event Process Simulation for Hidden Markov Models to Predict Competitor Time-to-Market(https://arxiv.org/abs/2411.04266)
Keywords: generative
Abstract: We study the challenge of predicting the time at which a competitor product, such as a novel high-capacity EV battery or a new car model, will be available to customers; as new information is obtained, this time-to-market estimate is revised. Our scenario is as follows: We assume that the product is under development at a Firm B, which is a competitor to Firm A; as they are in the same industry, Firm A has a relatively good understanding of the processes and steps required to produce the product. While Firm B tries to keep its activities hidden (think of stealth-mode for start-ups), Firm A is nevertheless able to gain periodic insights by observing what type of resources Firm B is using. We show how Firm A can build a model that predicts when Firm B will be ready to sell its product; the model leverages knowledge of the underlying processes and required resources to build a Parallel Discrete Simulation (PDES)-based process model that it then uses as a generative model to train a Hidden Markov Model (HMM). We study the question of how many resource observations Firm A requires in order to accurately assess the current state of development at Firm B. In order to gain general insights into the capabilities of this approach, we study the effect of different process graph densities, different densities of the resource-activity maps, etc., and also scaling properties as we increase the number resource counts. We find that in most cases, the HMM achieves a prediction accuracy of 70 to 80 percent after 20 (daily) observations of a production process that lasts 150 days on average and we characterize the effects of different problem instance densities on this prediction accuracy. Our results give insight into the level of market knowledge required for accurate and early time-to-market prediction.

Title: Enhancing Security Control Production With Generative AI

Authors: Chen Ling, Mina Ghashami, Vianne Gao, Ali Torkamani, Ruslan Vaulin, Nivedita Mangam, Bhavya Jain, Farhan Diwan, Malini SS, Mingrui Cheng, Shreya Tarur Kumar, Felix Candelario
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.04284
Pdf URL: https://arxiv.org/pdf/2411.04284
Copy Paste: [[2411.04284]] Enhancing Security Control Production With Generative AI(https://arxiv.org/abs/2411.04284)
Keywords: generative, in-context
Abstract: Security controls are mechanisms or policies designed for cloud based services to reduce risk, protect information, and ensure compliance with security regulations. The development of security controls is traditionally a labor-intensive and time-consuming process. This paper explores the use of Generative AI to accelerate the generation of security controls. We specifically focus on generating Gherkin codes which are the domain-specific language used to define the behavior of security controls in a structured and understandable format. By leveraging large language models and in-context learning, we propose a structured framework that reduces the time required for developing security controls from 2-3 days to less than one minute. Our approach integrates detailed task descriptions, step-by-step instructions, and retrieval-augmented generation to enhance the accuracy and efficiency of the generated Gherkin code. Initial evaluations on AWS cloud services demonstrate promising results, indicating that GenAI can effectively streamline the security control development process, thus providing a robust and dynamic safeguard for cloud-based infrastructures.

Title: Efficient Symmetry-Aware Materials Generation via Hierarchical Generative Flow Networks

Authors: Tri Minh Nguyen, Sherif Abdulkader Tawfik, Truyen Tran, Sunil Gupta, Santu Rana, Svetha Venkatesh
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2411.04323
Pdf URL: https://arxiv.org/pdf/2411.04323
Copy Paste: [[2411.04323]] Efficient Symmetry-Aware Materials Generation via Hierarchical Generative Flow Networks(https://arxiv.org/abs/2411.04323)
Keywords: diffusion, generative
Abstract: Discovering new solid-state materials requires rapidly exploring the vast space of crystal structures and locating stable regions. Generating stable materials with desired properties and compositions is extremely difficult as we search for very small isolated pockets in the exponentially many possibilities, considering elements from the periodic table and their 3D arrangements in crystal lattices. Materials discovery necessitates both optimized solution structures and diversity in the generated material structures. Existing methods struggle to explore large material spaces and generate diverse samples with desired properties and requirements. We propose the Symmetry-aware Hierarchical Architecture for Flow-based Traversal (SHAFT), a novel generative model employing a hierarchical exploration strategy to efficiently exploit the symmetry of the materials space to generate crystal structures given desired properties. In particular, our model decomposes the exponentially large materials space into a hierarchy of subspaces consisting of symmetric space groups, lattice parameters, and atoms. We demonstrate that SHAFT significantly outperforms state-of-the-art iterative generative methods, such as Generative Flow Networks (GFlowNets) and Crystal Diffusion Variational AutoEncoders (CDVAE), in crystal structure generation tasks, achieving higher validity, diversity, and stability of generated structures optimized for target properties and requirements.

Title: HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images

Authors: Zhenyue Qin, Yiqun Zhang, Yang Liu, Dylan Campbell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04332
Pdf URL: https://arxiv.org/pdf/2411.04332
Copy Paste: [[2411.04332]] HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images(https://arxiv.org/abs/2411.04332)
Keywords: diffusion, generative
Abstract: Generative text-to-image models, such as Stable Diffusion, have demonstrated a remarkable ability to generate diverse, high-quality images. However, they are surprisingly inept when it comes to rendering human hands, which are often anatomically incorrect or reside in the "uncanny valley". In this paper, we propose a method HandCraft for restoring such malformed hands. This is achieved by automatically constructing masks and depth images for hands as conditioning signals using a parametric model, allowing a diffusion-based image editor to fix the hand's anatomy and adjust its pose while seamlessly integrating the changes into the original image, preserving pose, color, and style. Our plug-and-play hand restoration solution is compatible with existing pretrained diffusion models, and the restoration process facilitates adoption by eschewing any fine-tuning or training requirements for the diffusion models. We also contribute MalHand datasets that contain generated images with a wide variety of malformed hands in several styles for hand detector training and hand restoration benchmarking, and demonstrate through qualitative and quantitative evaluation that HandCraft not only restores anatomical correctness but also maintains the integrity of the overall image.

Title: GazeGen: Gaze-Driven User Interaction for Visual Content Generation

Authors: He-Yen Hsieh, Ziyun Li, Sai Qian Zhang, Wei-Te Mark Ting, Kao-Den Chang, Barbara De Salvo, Chiao Liu, H. T. Kung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04335
Pdf URL: https://arxiv.org/pdf/2411.04335
Copy Paste: [[2411.04335]] GazeGen: Gaze-Driven User Interaction for Visual Content Generation(https://arxiv.org/abs/2411.04335)
Keywords: generative
Abstract: We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.

Title: MegaPortrait: Revisiting Diffusion Control for High-fidelity Portrait Generation

Authors: Han Yang, Sotiris Anagnostidis, Enis Simsar, Thomas Hofmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04357
Pdf URL: https://arxiv.org/pdf/2411.04357
Copy Paste: [[2411.04357]] MegaPortrait: Revisiting Diffusion Control for High-fidelity Portrait Generation(https://arxiv.org/abs/2411.04357)
Keywords: diffusion
Abstract: We propose MegaPortrait. It's an innovative system for creating personalized portrait images in computer vision. It has three modules: Identity Net, Shading Net, and Harmonization Net. Identity Net generates learned identity using a customized model fine-tuned with source images. Shading Net re-renders portraits using extracted representations. Harmonization Net fuses pasted faces and the reference image's body for coherent results. Our approach with off-the-shelf Controlnets is better than state-of-the-art AI portrait products in identity preservation and image fidelity. MegaPortrait has a simple but effective design and we compare it with other methods and products to show its superiority.

Title: Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

Authors: Vaibhav Seth, Arinjay Pathak, Ayan Sengupta, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.04358
Pdf URL: https://arxiv.org/pdf/2411.04358
Copy Paste: [[2411.04358]] Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation(https://arxiv.org/abs/2411.04358)
Keywords: generative
Abstract: Large Language Models (LLMs) are highly resource-intensive to fine-tune due to their enormous size. While low-rank adaptation is a prominent parameter-efficient fine-tuning approach, it suffers from sensitivity to hyperparameter choices, leading to instability in model performance on fine-tuning downstream tasks. This paper highlights the importance of effective parameterization in low-rank fine-tuning to reduce estimator variance and enhance the stability of final model outputs. We propose MonteCLoRA, an efficient fine-tuning technique, employing Monte Carlo estimation to learn an unbiased posterior estimation of low-rank parameters with low expected variance, which stabilizes fine-tuned LLMs with only O(1) additional parameters. MonteCLoRA shows significant improvements in accuracy and robustness, achieving up to 3.8% higher accuracy and 8.6% greater robustness than existing efficient fine-tuning methods on natural language understanding tasks with pre-trained RoBERTa-base. Furthermore, in generative tasks with pre-trained LLaMA-1-7B, MonteCLoRA demonstrates robust zero-shot performance with 50% lower variance than the contemporary efficient fine-tuning methods. The theoretical and empirical results presented in the paper underscore how parameterization and hyperpriors balance exploration-exploitation in the low-rank parametric space, therefore leading to more optimal and robust parameter estimation during efficient fine-tuning.

Title: TrajGPT: Controlled Synthetic Trajectory Generation Using a Multitask Transformer-Based Spatiotemporal Model

Authors: Shang-Ling Hsu, Emmanuel Tung, John Krumm, Cyrus Shahabi, Khurram Shafique
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.04381
Pdf URL: https://arxiv.org/pdf/2411.04381
Copy Paste: [[2411.04381]] TrajGPT: Controlled Synthetic Trajectory Generation Using a Multitask Transformer-Based Spatiotemporal Model(https://arxiv.org/abs/2411.04381)
Keywords: generative
Abstract: Human mobility modeling from GPS-trajectories and synthetic trajectory generation are crucial for various applications, such as urban planning, disaster management and epidemiology. Both of these tasks often require filling gaps in a partially specified sequence of visits - a new problem that we call "controlled" synthetic trajectory generation. Existing methods for next-location prediction or synthetic trajectory generation cannot solve this problem as they lack the mechanisms needed to constrain the generated sequences of visits. Moreover, existing approaches (1) frequently treat space and time as independent factors, an assumption that fails to hold true in real-world scenarios, and (2) suffer from challenges in accuracy of temporal prediction as they fail to deal with mixed distributions and the inter-relationships of different modes with latent variables (e.g., day-of-the-week). These limitations become even more pronounced when the task involves filling gaps within sequences instead of solely predicting the next visit. We introduce TrajGPT, a transformer-based, multi-task, joint spatiotemporal generative model to address these issues. Taking inspiration from large language models, TrajGPT poses the problem of controlled trajectory generation as that of text infilling in natural language. TrajGPT integrates the spatial and temporal models in a transformer architecture through a Bayesian probability model that ensures that the gaps in a visit sequence are filled in a spatiotemporally consistent manner. Our experiments on public and private datasets demonstrate that TrajGPT not only excels in controlled synthetic visit generation but also outperforms competing models in next-location prediction tasks - Relatively, TrajGPT achieves a 26-fold improvement in temporal accuracy while retaining more than 98% of spatial accuracy on average.

Title: Unlearning in- vs. out-of-distribution data in LLMs under gradient-based method

Authors: Teodora Baluta, Pascal Lamblin, Daniel Tarlow, Fabian Pedregosa, Gintare Karolina Dziugaite
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.04388
Pdf URL: https://arxiv.org/pdf/2411.04388
Copy Paste: [[2411.04388]] Unlearning in- vs. out-of-distribution data in LLMs under gradient-based method(https://arxiv.org/abs/2411.04388)
Keywords: generative
Abstract: Machine unlearning aims to solve the problem of removing the influence of selected training examples from a learned model. Despite the increasing attention to this problem, it remains an open research question how to evaluate unlearning in large language models (LLMs), and what are the critical properties of the data to be unlearned that affect the quality and efficiency of unlearning. This work formalizes a metric to evaluate unlearning quality in generative models, and uses it to assess the trade-offs between unlearning quality and performance. We demonstrate that unlearning out-of-distribution examples requires more unlearning steps but overall presents a better trade-off overall. For in-distribution examples, however, we observe a rapid decay in performance as unlearning progresses. We further evaluate how example's memorization and difficulty affect unlearning under a classical gradient ascent-based approach.

Title: Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Authors: Yicheng Gao, Gonghan Xu, Zhe Wang, Arman Cohan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04424
Pdf URL: https://arxiv.org/pdf/2411.04424
Copy Paste: [[2411.04424]] Bayesian Calibration of Win Rate Estimation with LLM Evaluators(https://arxiv.org/abs/2411.04424)
Keywords: generative
Abstract: Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of text generations from LLMs. However, applying LLM evaluators naively to compare or judge between different systems can lead to unreliable results due to the intrinsic win rate estimation bias of LLM evaluators. In order to mitigate this problem, we propose two calibration methods, Bayesian Win Rate Sampling (BWRS) and Bayesian Dawid-Skene, both of which leverage Bayesian inference to more accurately infer the true win rate of generative language models. We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks. We show that both our methods are effective in improving the accuracy of win rate estimation using LLMs as evaluators, offering a promising direction for reliable automatic text quality evaluation.

Title: Scaling Laws for Pre-training Agents and World Models

Authors: Tim Pearce, Tabish Rashid, Dave Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04434
Pdf URL: https://arxiv.org/pdf/2411.04434
Copy Paste: [[2411.04434]] Scaling Laws for Pre-training Agents and World Models(https://arxiv.org/abs/2411.04434)
Keywords: generative
Abstract: The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling (e.g. between loss and optimal model size), also arise in world modeling and imitation learning. However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.

Title: Comparing Fairness of Generative Mobility Models

Authors: Daniel Wang, Jack McFarland, Afra Mashhadi, Ekin Ugurel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.04453
Pdf URL: https://arxiv.org/pdf/2411.04453
Copy Paste: [[2411.04453]] Comparing Fairness of Generative Mobility Models(https://arxiv.org/abs/2411.04453)
Keywords: generative
Abstract: This work examines the fairness of generative mobility models, addressing the often overlooked dimension of equity in model performance across geographic regions. Predictive models built on crowd flow data are instrumental in understanding urban structures and movement patterns; however, they risk embedding biases, particularly in spatiotemporal contexts where model performance may reflect and reinforce existing inequities tied to geographic distribution. We propose a novel framework for assessing fairness by measuring the utility and equity of generated traces. Utility is assessed via the Common Part of Commuters (CPC), a similarity metric comparing generated and real mobility flows, while fairness is evaluated using demographic parity. By reformulating demographic parity to reflect the difference in CPC distribution between two groups, our analysis reveals disparities in how various models encode biases present in the underlying data. We utilized four models (Gravity, Radiation, Deep Gravity, and Non-linear Gravity) and our results indicate that traditional gravity and radiation models produce fairer outcomes, although Deep Gravity achieves higher CPC. This disparity underscores a trade-off between model accuracy and equity, with the feature-rich Deep Gravity model amplifying pre-existing biases in community representations. Our findings emphasize the importance of integrating fairness metrics in mobility modeling to avoid perpetuating inequities.

Title: Series-to-Series Diffusion Bridge Model

Authors: Hao Yang, Zhanbo Feng, Feng Zhou, Robert C Qiu, Zenan Ling
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04491
Pdf URL: https://arxiv.org/pdf/2411.04491
Copy Paste: [[2411.04491]] Series-to-Series Diffusion Bridge Model(https://arxiv.org/abs/2411.04491)
Keywords: diffusion
Abstract: Diffusion models have risen to prominence in time series forecasting, showcasing their robust capability to model complex data distributions. However, their effectiveness in deterministic predictions is often constrained by instability arising from their inherent stochasticity. In this paper, we revisit time series diffusion models and present a comprehensive framework that encompasses most existing diffusion-based methods. Building on this theoretical foundation, we propose a novel diffusion-based time series forecasting model, the Series-to-Series Diffusion Bridge Model ($\mathrm{S^2DBM}$), which leverages the Brownian Bridge process to reduce randomness in reverse estimations and improves accuracy by incorporating informative priors and conditions derived from historical time series data. Experimental results demonstrate that $\mathrm{S^2DBM}$ delivers superior performance in point-to-point forecasting and competes effectively with other diffusion-based models in probabilistic forecasting.

Title: Hypercube Policy Regularization Framework for Offline Reinforcement Learning

Authors: Yi Shen, Hanyan Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.04534
Pdf URL: https://arxiv.org/pdf/2411.04534
Copy Paste: [[2411.04534]] Hypercube Policy Regularization Framework for Offline Reinforcement Learning(https://arxiv.org/abs/2411.04534)
Keywords: diffusion
Abstract: Offline reinforcement learning has received extensive attention from scholars because it avoids the interaction between the agent and the environment by learning a policy through a static dataset. However, general reinforcement learning methods cannot get satisfactory results in offline reinforcement learning due to the out-of-distribution state actions that the dataset cannot cover during training. To solve this problem, the policy regularization method that tries to directly clone policies used in static datasets has received numerous studies due to its simplicity and effectiveness. However, policy constraint methods make the agent choose the corresponding actions in the static dataset. This type of constraint is usually over-conservative, which results in suboptimal policies, especially in low-quality static datasets. In this paper, a hypercube policy regularization framework is proposed, this method alleviates the constraints of policy constraint methods by allowing the agent to explore the actions corresponding to similar states in the static dataset, which increases the effectiveness of algorithms in low-quality datasets. It was also theoretically demonstrated that the hypercube policy regularization framework can effectively improve the performance of original algorithms. In addition, the hypercube policy regularization framework is combined with TD3-BC and Diffusion-QL for experiments on D4RL datasets which are called TD3-BC-C and Diffusion-QL-C. The experimental results of the score demonstrate that TD3-BC-C and Diffusion-QL-C perform better than state-of-the-art algorithms like IQL, CQL, TD3-BC and Diffusion-QL in most D4RL environments in approximate time.

Title: Meta-Reasoning Improves Tool Use in Large Language Models

Authors: Lisa Alazraki, Marek Rei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04535
Pdf URL: https://arxiv.org/pdf/2411.04535
Copy Paste: [[2411.04535]] Meta-Reasoning Improves Tool Use in Large Language Models(https://arxiv.org/abs/2411.04535)
Keywords: in-context
Abstract: External tools help large language models (LLMs) succeed at tasks where they would otherwise typically fail. In existing frameworks, LLMs learn tool use either by in-context demonstrations or via full model fine-tuning on annotated data. As these approaches do not easily scale, a recent trend is to abandon them in favor of lightweight, parameter-efficient tuning paradigms. These methods allow quickly alternating between the frozen LLM and its specialised fine-tuned version, by switching on or off a handful of additional custom parameters. Hence, we postulate that the generalization ability of the frozen model can be leveraged to improve tool selection. We present Tool selECTion via meta-reasONing (TECTON), a two-phase system that first reasons over a task using a custom fine-tuned LM head and outputs candidate tools. Then, with the custom head disabled, it meta-reasons (i.e., it reasons over the previous reasoning process) to make a final choice. We show that TECTON results in substantial gains - both in-distribution and out-of-distribution - on a range of math reasoning datasets.

Title: Peri-midFormer: Periodic Pyramid Transformer for Time Series Analysis

Authors: Qiang Wu, Gechang Yao, Zhixi Feng, Shuyuan Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.04554
Pdf URL: https://arxiv.org/pdf/2411.04554
Copy Paste: [[2411.04554]] Peri-midFormer: Periodic Pyramid Transformer for Time Series Analysis(https://arxiv.org/abs/2411.04554)
Keywords: anomaly
Abstract: Time series analysis finds wide applications in fields such as weather forecasting, anomaly detection, and behavior recognition. Previous methods attempted to model temporal variations directly using 1D time series. However, this has been quite challenging due to the discrete nature of data points in time series and the complexity of periodic variation. In terms of periodicity, taking weather and traffic data as an example, there are multi-periodic variations such as yearly, monthly, weekly, and daily, etc. In order to break through the limitations of the previous methods, we decouple the implied complex periodic variations into inclusion and overlap relationships among different level periodic components based on the observation of the multi-periodicity therein and its inclusion relationships. This explicitly represents the naturally occurring pyramid-like properties in time series, where the top level is the original time series and lower levels consist of periodic components with gradually shorter periods, which we call the periodic pyramid. To further extract complex temporal variations, we introduce self-attention mechanism into the periodic pyramid, capturing complex periodic relationships by computing attention between periodic components based on their inclusion, overlap, and adjacency relationships. Our proposed Peri-midFormer demonstrates outstanding performance in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection.

Title: Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning

Authors: Marvin Alles, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04562
Pdf URL: https://arxiv.org/pdf/2411.04562
Copy Paste: [[2411.04562]] Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning(https://arxiv.org/abs/2411.04562)
Keywords: generative
Abstract: In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

Title: DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Authors: Yuxuan Duan, Yan Hong, Bo Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Li Niu, Liqing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04571
Pdf URL: https://arxiv.org/pdf/2411.04571
Copy Paste: [[2411.04571]] DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning(https://arxiv.org/abs/2411.04571)
Keywords: diffusion
Abstract: The recent progress in text-to-image models pretrained on large-scale datasets has enabled us to generate various images as long as we provide a text prompt describing what we want. Nevertheless, the availability of these models is still limited when we expect to generate images that fall into a specific domain either hard to describe or just unseen to the models. In this work, we propose DomainGallery, a few-shot domain-driven image generation method which aims at finetuning pretrained Stable Diffusion on few-shot target datasets in an attribute-centric manner. Specifically, DomainGallery features prior attribute erasure, attribute disentanglement, regularization and enhancement. These techniques are tailored to few-shot domain-driven generation in order to solve key issues that previous works have failed to settle. Extensive experiments are given to validate the superior performance of DomainGallery on a variety of domain-driven generation scenarios. Codes are available at this https URL.

Title: Social EgoMesh Estimation

Authors: Luca Scofano, Alessio Sampieri, Edoardo De Matteis, Indro Spinelli, Fabio Galasso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04598
Pdf URL: https://arxiv.org/pdf/2411.04598
Copy Paste: [[2411.04598]] Social EgoMesh Estimation(https://arxiv.org/abs/2411.04598)
Keywords: diffusion
Abstract: Accurately estimating the 3D pose of the camera wearer in egocentric video sequences is crucial to modeling human behavior in virtual and augmented reality applications. The task presents unique challenges due to the limited visibility of the user's body caused by the front-facing camera mounted on their head. Recent research has explored the utilization of the scene and ego-motion, but it has overlooked humans' interactive nature. We propose a novel framework for Social Egocentric Estimation of body MEshes (SEE-ME). Our approach is the first to estimate the wearer's mesh using only a latent probabilistic diffusion model, which we condition on the scene and, for the first time, on the social wearer-interactee interactions. Our in-depth study sheds light on when social interaction matters most for ego-mesh estimation; it quantifies the impact of interpersonal distance and gaze direction. Overall, SEE-ME surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%. The code is available at this https URL.

Title: Solar potential analysis over Indian cities using high-resolution satellite imagery and DEM

Authors: Jai Singla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04610
Pdf URL: https://arxiv.org/pdf/2411.04610
Copy Paste: [[2411.04610]] Solar potential analysis over Indian cities using high-resolution satellite imagery and DEM(https://arxiv.org/abs/2411.04610)
Keywords: diffusion
Abstract: Most of the research work in the solar potential analysis is performed utilizing aerial imagery, LiDAR data, and satellite imagery. However, in the existing studies using satellite data, parameters such as trees/ vegetation shadow, adjacent higher architectural structures, and eccentric roof structures in urban areas were not considered, and relatively coarser-resolution datasets were used for analysis. In this work, we have implemented a novel approach to estimate rooftop solar potential using inputs of high-resolution satellite imagery (0.5 cm), a digital elevation model (1m), along with ground station radiation data. Solar radiation analysis is performed using the diffusion proportion and transmissivity ratio derived from the ground station data hosted by IMD. It was observed that due to seasonal variations, environmental effects and technical reasons such as solar panel structure etc., there can be a significant loss of electricity generation up to 50%. Based on the results, it is also understood that using 1m DEM and 50cm satellite imagery, more authentic results are produced over the urban areas.

Title: Brain Tumour Removing and Missing Modality Generation using 3D WDM

Authors: André Ferreira, Gijs Luijten, Behrus Puladi, Jens Kleesiek, Victor Alves, Jan Egger
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.04630
Pdf URL: https://arxiv.org/pdf/2411.04630
Copy Paste: [[2411.04630]] Brain Tumour Removing and Missing Modality Generation using 3D WDM(https://arxiv.org/abs/2411.04630)
Keywords: diffusion
Abstract: This paper presents the second-placed solution for task 8 and the participation solution for task 7 of BraTS 2024. The adoption of automated brain analysis algorithms to support clinical practice is increasing. However, many of these algorithms struggle with the presence of brain lesions or the absence of certain MRI modalities. The alterations in the brain's morphology leads to high variability and thus poor performance of predictive models that were trained only on healthy brains. The lack of information that is usually provided by some of the missing MRI modalities also reduces the reliability of the prediction models trained with all modalities. In order to improve the performance of these models, we propose the use of conditional 3D wavelet diffusion models. The wavelet transform enabled full-resolution image training and prediction on a GPU with 48 GB VRAM, without patching or downsampling, preserving all information for prediction. For the inpainting task of BraTS 2024, the use of a large and variable number of healthy masks and the stability and efficiency of the 3D wavelet diffusion model resulted in 0.007, 22.61 and 0.842 in the validation set and 0.07 , 22.8 and 0.91 in the testing set (MSE, PSNR and SSIM respectively). The code for these tasks is available at this https URL.

Title: DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction

Authors: Li Zhao, Zhengmin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04646
Pdf URL: https://arxiv.org/pdf/2411.04646
Copy Paste: [[2411.04646]] DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction(https://arxiv.org/abs/2411.04646)
Keywords: diffusion
Abstract: This paper introduces DanceFusion, a novel framework for reconstructing and generating dance movements synchronized to music, utilizing a Spatio-Temporal Skeleton Diffusion Transformer. The framework adeptly handles incomplete and noisy skeletal data common in short-form dance videos on social media platforms like TikTok. DanceFusion incorporates a hierarchical Transformer-based Variational Autoencoder (VAE) integrated with a diffusion model, significantly enhancing motion realism and accuracy. Our approach introduces sophisticated masking techniques and a unique iterative diffusion process that refines the motion sequences, ensuring high fidelity in both motion generation and synchronization with accompanying audio cues. Comprehensive evaluations demonstrate that DanceFusion surpasses existing methods, providing state-of-the-art performance in generating dynamic, realistic, and stylistically diverse dance motions. Potential applications of this framework extend to content creation, virtual reality, and interactive entertainment, promising substantial advancements in automated dance generation. Visit our project page at this https URL.

Title: From CNN to ConvRNN: Adapting Visualization Techniques for Time-Series Anomaly Detection

Authors: Fabien Poirier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04707
Pdf URL: https://arxiv.org/pdf/2411.04707
Copy Paste: [[2411.04707]] From CNN to ConvRNN: Adapting Visualization Techniques for Time-Series Anomaly Detection(https://arxiv.org/abs/2411.04707)
Keywords: anomaly
Abstract: Nowadays, neural networks are commonly used to solve various problems. Unfortunately, despite their effectiveness, they are often perceived as black boxes capable of providing answers without explaining their decisions, which raises numerous ethical and legal concerns. Fortunately, the field of explainability helps users understand these results. This aspect of machine learning allows users to grasp the decision-making process of a model and verify the relevance of its outcomes. In this article, we focus on the learning process carried out by a ``time distributed`` convRNN, which performs anomaly detection from video data.

Title: TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Authors: Wenhao Wang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04709
Pdf URL: https://arxiv.org/pdf/2411.04709
Copy Paste: [[2411.04709]] TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation(https://arxiv.org/abs/2411.04709)
Keywords: diffusion
Abstract: Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at this https URL.

Title: SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Authors: Shivanshu Shekhar, Shreyas Singh, Tong Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.04712
Pdf URL: https://arxiv.org/pdf/2411.04712
Copy Paste: [[2411.04712]] SEE-DPO: Self Entropy Enhanced Direct Preference Optimization(https://arxiv.org/abs/2411.04712)
Keywords: diffusion, generative
Abstract: Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

Title: Multi-Reward as Condition for Instruction-based Image Editing

Authors: Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04713
Pdf URL: https://arxiv.org/pdf/2411.04713
Copy Paste: [[2411.04713]] Multi-Reward as Condition for Instruction-based Image Editing(https://arxiv.org/abs/2411.04713)
Keywords: diffusion, generative
Abstract: High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in $0\sim 5$ and text descriptive feedback on the specific failure points in ground-truth edited images, resulting in a high-quality editing reward dataset, i.e., RewardEdit20K. 2) We further proposed a novel training framework to seamlessly integrate the metric output, regarded as multi-reward, into editing models to learn from the imperfect training triplets. During training, the reward scores and text descriptions are encoded as embeddings and fed into both the latent space and the U-Net of the editing models as auxiliary conditions. During inference, we set these additional conditions to the highest score with no text description for failure points, to aim at the best generation outcome. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., InsPix2Pix and SmartEdit. The code and dataset will be released.

Title: Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

Authors: Benito Buchheim, Max Reimann, Jürgen Döllner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04724
Pdf URL: https://arxiv.org/pdf/2411.04724
Copy Paste: [[2411.04724]] Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation(https://arxiv.org/abs/2411.04724)
Keywords: diffusion
Abstract: We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model's visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.

Title: Taming Rectified Flow for Inversion and Editing

Authors: Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04746
Pdf URL: https://arxiv.org/pdf/2411.04746
Copy Paste: [[2411.04746]] Taming Rectified Flow for Inversion and Editing(https://arxiv.org/abs/2411.04746)
Keywords: diffusion, generative
Abstract: Rectified-flow-based diffusion transformers, such as FLUX and OpenSora, have demonstrated exceptional performance in the field of image and video generation. Despite their robust generative capabilities, these models often suffer from inaccurate inversion, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that enhances inversion precision by reducing errors in the process of solving rectified flow ODEs. Specifically, we derive the exact formulation of the rectified flow ODE and perform a high-order Taylor expansion to estimate its nonlinear components, significantly decreasing the approximation error at each timestep. Building upon RF-Solver, we further design RF-Edit, which comprises specialized sub-modules for image and video editing. By sharing self-attention layer features during the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments on text-to-image generation, image & video inversion, and image & video editing demonstrate the robust performance and adaptability of our methods. Code is available at this https URL.

Title: Attention Masks Help Adversarial Attacks to Bypass Safety Detectors

Authors: Yunfan Shi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04772
Pdf URL: https://arxiv.org/pdf/2411.04772
Copy Paste: [[2411.04772]] Attention Masks Help Adversarial Attacks to Bypass Safety Detectors(https://arxiv.org/abs/2411.04772)
Keywords: self-supervised
Abstract: Despite recent research advancements in adversarial attack methods, current approaches against XAI monitors are still discoverable and slower. In this paper, we present an adaptive framework for attention mask generation to enable stealthy, explainable and efficient PGD image classification adversarial attack under XAI monitors. Specifically, we utilize mutation XAI mixture and multitask self-supervised X-UNet for attention mask generation to guide PGD attack. Experiments on MNIST (MLP), CIFAR-10 (AlexNet) have shown that our system can outperform benchmark PGD, Sparsefool and SOTA SINIFGSM in balancing among stealth, efficiency and explainability which is crucial for effectively fooling SOTA defense protected classifiers.

Title: Learn to Solve Vehicle Routing Problems ASAP: A Neural Optimization Approach for Time-Constrained Vehicle Routing Problems with Finite Vehicle Fleet

Authors: Elija Deineko, Carina Kehrt
Subjects: cs.LG, cs.ET
Abstract URL: https://arxiv.org/abs/2411.04777
Pdf URL: https://arxiv.org/pdf/2411.04777
Copy Paste: [[2411.04777]] Learn to Solve Vehicle Routing Problems ASAP: A Neural Optimization Approach for Time-Constrained Vehicle Routing Problems with Finite Vehicle Fleet(https://arxiv.org/abs/2411.04777)
Keywords: generative
Abstract: Finding a feasible and prompt solution to the Vehicle Routing Problem (VRP) is a prerequisite for efficient freight transportation, seamless logistics, and sustainable mobility. Traditional optimization methods reach their limits when confronted with the real-world complexity of VRPs, which involve numerous constraints and objectives. Recently, the ability of generative Artificial Intelligence (AI) to solve combinatorial tasks, known as Neural Combinatorial Optimization (NCO), demonstrated promising results, offering new perspectives. In this study, we propose an NCO approach to solve a time-constrained capacitated VRP with a finite vehicle fleet size. The approach is based on an encoder-decoder architecture, formulated in line with the Policy Optimization with Multiple Optima (POMO) protocol and trained via a Proximal Policy Optimization (PPO) algorithm. We successfully trained the policy with multiple objectives (minimizing the total distance while maximizing vehicle utilization) and evaluated it on medium and large instances, benchmarking it against state-of-the-art heuristics. The method is able to find adequate and cost-efficient solutions, showing both flexibility and robust generalization. Finally, we provide a critical analysis of the solution generated by NCO and discuss the challenges and opportunities of this new branch of intelligent learning algorithms emerging in optimization science, focusing on freight transportation.

Title: End-to-end Inception-Unet based Generative Adversarial Networks for Snow and Rain Removals

Authors: Ibrahim Kajo, Mohamed Kas, Yassine Ruichek
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.04821
Pdf URL: https://arxiv.org/pdf/2411.04821
Copy Paste: [[2411.04821]] End-to-end Inception-Unet based Generative Adversarial Networks for Snow and Rain Removals(https://arxiv.org/abs/2411.04821)
Keywords: generative
Abstract: The superior performance introduced by deep learning approaches in removing atmospheric particles such as snow and rain from a single image; favors their usage over classical ones. However, deep learning-based approaches still suffer from challenges related to the particle appearance characteristics such as size, type, and transparency. Furthermore, due to the unique characteristics of rain and snow particles, single network based deep learning approaches struggle in handling both degradation scenarios simultaneously. In this paper, a global framework that consists of two Generative Adversarial Networks (GANs) is proposed where each handles the removal of each particle individually. The architectures of both desnowing and deraining GANs introduce the integration of a feature extraction phase with the classical U-net generator network which in turn enhances the removal performance in the presence of severe variations in size and appearance. Furthermore, a realistic dataset that contains pairs of snowy images next to their groundtruth images estimated using a low-rank approximation approach; is presented. The experiments show that the proposed desnowing and deraining approaches achieve significant improvements in comparison to the state-of-the-art approaches when tested on both synthetic and realistic datasets.

Title: VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Authors: Ming Cheng, Jiaying Gong, Chenhan Yuan, William A. Ingram, Edward Fox, Hoda Eldardiry
Subjects: cs.CL, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.04825
Pdf URL: https://arxiv.org/pdf/2411.04825
Copy Paste: [[2411.04825]] VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models(https://arxiv.org/abs/2411.04825)
Keywords: generative
Abstract: Existing text simplification or paraphrase datasets mainly focus on sentence-level text generation in a general domain. These datasets are typically developed without using domain knowledge. In this paper, we release a novel dataset, VTechAGP, which is the first academic-to-general-audience text paraphrase dataset consisting of 4,938 document-level these and dissertation academic and general-audience abstract pairs from 8 colleges authored over 25 years. We also propose a novel dynamic soft prompt generative language model, DSPT5. For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt. For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels to further select the best output candidate. We evaluate DSPT5 and various state-of-the-art large language models (LLMs) from multiple perspectives. Results demonstrate that the SOTA LLMs does not provide satisfactory outcomes, while the lightweight DSPT5 can achieve competitive results. To the best of our knowledge, we are the first to build a benchmark dataset and solutions for academic-to-general-audience text paraphrase dataset.

Title: D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

Authors: Siyu Chen, Hong Liu, Wenhao Li, Ying Zhu, Guoquan Wang, Jianbing Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.04826
Pdf URL: https://arxiv.org/pdf/2411.04826
Copy Paste: [[2411.04826]] D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes(https://arxiv.org/abs/2411.04826)
Keywords: self-supervised
Abstract: Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present D$^3$epth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \url{this https URL}.

Title: OneProt: Towards Multi-Modal Protein Foundation Models

Authors: Klemens Flöge, Srisruthi Udayakumar, Johanna Sommer, Marie Piraud, Stefan Kesselheim, Vincent Fortuin, Stephan Günneman, Karel J van der Weg, Holger Gohlke, Alina Bazarova, Erinc Merdivan
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.04863
Pdf URL: https://arxiv.org/pdf/2411.04863
Copy Paste: [[2411.04863]] OneProt: Towards Multi-Modal Protein Foundation Models(https://arxiv.org/abs/2411.04863)
Keywords: foundation model
Abstract: Recent AI advances have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of modality encoders along protein sequences. It demonstrates strong performance in retrieval tasks and surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.

Title: Boosting Latent Diffusion with Perceptual Objectives

Authors: Tariq Berrada, Pietro Astolfi, Jakob Verbeek, Melissa Hall, Marton Havasi, Michal Drozdzal, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04873
Pdf URL: https://arxiv.org/pdf/2411.04873
Copy Paste: [[2411.04873]] Boosting Latent Diffusion with Perceptual Objectives(https://arxiv.org/abs/2411.04873)
Keywords: diffusion, generative
Abstract: Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.

Title: In the Era of Prompt Learning with Vision-Language Models

Authors: Ankit Jha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04892
Pdf URL: https://arxiv.org/pdf/2411.04892
Copy Paste: [[2411.04892]] In the Era of Prompt Learning with Vision-Language Models(https://arxiv.org/abs/2411.04892)
Keywords: foundation model
Abstract: Large-scale foundation models like CLIP have shown strong zero-shot generalization but struggle with domain shifts, limiting their adaptability. In our work, we introduce \textsc{StyLIP}, a novel domain-agnostic prompt learning strategy for Domain Generalization (DG). StyLIP disentangles visual style and content in CLIP`s vision encoder by using style projectors to learn domain-specific prompt tokens and combining them with content features. Trained contrastively, this approach enables seamless adaptation across domains, outperforming state-of-the-art methods on multiple DG benchmarks. Additionally, we propose AD-CLIP for unsupervised domain adaptation (DA), leveraging CLIP`s frozen vision backbone to learn domain-invariant prompts through image style and content features. By aligning domains in embedding space with entropy minimization, AD-CLIP effectively handles domain shifts, even when only target domain samples are available. Lastly, we outline future work on class discovery using prompt learning for semantic segmentation in remote sensing, focusing on identifying novel or rare classes in unstructured environments. This paves the way for more adaptive and generalizable models in complex, real-world scenarios.

Title: GASE: Generatively Augmented Sentence Encoding

Authors: Manuel Frank, Haithem Afli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.04914
Pdf URL: https://arxiv.org/pdf/2411.04914
Copy Paste: [[2411.04914]] GASE: Generatively Augmented Sentence Encoding(https://arxiv.org/abs/2411.04914)
Keywords: generative
Abstract: We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time. Unlike conventional data augmentation that utilises synthetic training data, our approach does not require access to model parameters or the computational resources typically required for fine-tuning state-of-the-art models. Generatively Augmented Sentence Encoding uses diverse linguistic synthetic variants of input texts generated by paraphrasing, summarising, or extracting keywords, followed by pooling the original and synthetic embeddings. Experimental results on the Massive Text Embedding Benchmark for Semantic Textual Similarity (STS) demonstrate performance improvements across a range of embedding models using different generative models for augmentation. We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance. These findings suggest that integrating generative augmentation at inference time adds semantic diversity and can enhance the robustness and generalizability of sentence embeddings for embedding models. Our results show that the degree to which generative augmentation can improve STS performance depends not only on the embedding model but also on the dataset. From a broader perspective, the approach allows trading training for inference compute.

Title: MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Authors: Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04924
Pdf URL: https://arxiv.org/pdf/2411.04924
Copy Paste: [[2411.04924]] MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views(https://arxiv.org/abs/2411.04924)
Keywords: diffusion
Abstract: We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: this https URL.

Title: DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

Authors: Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2411.04928
Pdf URL: https://arxiv.org/pdf/2411.04928
Copy Paste: [[2411.04928]] DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion(https://arxiv.org/abs/2411.04928)
Keywords: diffusion
Abstract: In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.

Title: CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Authors: Jingwei Xu, Chenyu Wang, Zibo Zhao, Wen Liu, Yi Ma, Shenghua Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04954
Pdf URL: https://arxiv.org/pdf/2411.04954
Copy Paste: [[2411.04954]] CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM(https://arxiv.org/abs/2411.04954)
Keywords: generative
Abstract: This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user's inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models' vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: this https URL

Title: Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification

Authors: Mischa Dombrowski, Hadrien Reynaud, Bernhard Kainz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.04956
Pdf URL: https://arxiv.org/pdf/2411.04956
Copy Paste: [[2411.04956]] Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification(https://arxiv.org/abs/2411.04956)
Keywords: diffusion, generative
Abstract: Latent Video Diffusion Models can easily deceive casual observers and domain experts alike thanks to the produced image quality and temporal consistency. Beyond entertainment, this creates opportunities around safe data sharing of fully synthetic datasets, which are crucial in healthcare, as well as other domains relying on sensitive personal information. However, privacy concerns with this approach have not fully been addressed yet, and models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data. This discrepancy may be partly due to the sampling space being a subspace of the training videos, effectively reducing the training data size for downstream models. Additionally, the reduced temporal consistency when generating long videos could be a contributing factor. In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better. Furthermore, to investigate downstream degradation factors, we propose to use a re-identification model, previously employed as a privacy preservation filter. We demonstrate that it is sufficient to train this model on the latent space of the video generator. Subsequently, we use these models to evaluate the subspace covered by synthetic video datasets and thus introduce a new way to measure the faithfulness of generative machine learning models. We focus on a specific application in healthcare echocardiography to illustrate the effectiveness of our novel methods. Our findings indicate that only up to 30.8% of the training videos are learned in latent video diffusion models, which could explain the lack of performance when training downstream tasks on synthetic data.

Title: VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes

Authors: Advaith V. Sethuraman, Onur Bagoren, Harikrishnan Seetharaman, Dalton Richardson, Joseph Taylor, Katherine A. Skinner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.04963
Pdf URL: https://arxiv.org/pdf/2411.04963
Copy Paste: [[2411.04963]] VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes(https://arxiv.org/abs/2411.04963)
Keywords: generative
Abstract: Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural representations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method's effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-the-art for transparent surface reconstruction.

Title: SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Authors: Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, David B. Lindell
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.04989
Pdf URL: https://arxiv.org/pdf/2411.04989
Copy Paste: [[2411.04989]] SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation(https://arxiv.org/abs/2411.04989)
Keywords: diffusion
Abstract: Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while being competitive with supervised models in terms of visual quality and motion fidelity.

Title: Clustering in Causal Attention Masking

Authors: Nikita Karagodin, Yury Polyanskiy, Philippe Rigollet
Subjects: cs.LG, cs.AI, math.AP, math.DS
Abstract URL: https://arxiv.org/abs/2411.04990
Pdf URL: https://arxiv.org/pdf/2411.04990
Copy Paste: [[2411.04990]] Clustering in Causal Attention Masking(https://arxiv.org/abs/2411.04990)
Keywords: generative
Abstract: This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (arXiv:2312.10794) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al. (arXiv:2312.10794) in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity. Additionally, we establish a connection to the classical Rényi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states.

Title: Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Authors: Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.04996
Pdf URL: https://arxiv.org/pdf/2411.04996
Copy Paste: [[2411.04996]] Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models(https://arxiv.org/abs/2411.04996)
Keywords: foundation model
Abstract: The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Title: ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Authors: David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E. Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, Nataniel Ruiz
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.05003
Pdf URL: https://arxiv.org/pdf/2411.05003
Copy Paste: [[2411.05003]] ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning(https://arxiv.org/abs/2411.05003)
Keywords: diffusion, generative
Abstract: Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.

Title: Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Authors: Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2411.05005
Pdf URL: https://arxiv.org/pdf/2411.05005
Copy Paste: [[2411.05005]] Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models(https://arxiv.org/abs/2411.05005)
Keywords: diffusion
Abstract: Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.

Title: ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing

Authors: Jun-Kun Chen, Yu-Xiong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.05006
Pdf URL: https://arxiv.org/pdf/2411.05006
Copy Paste: [[2411.05006]] ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing(https://arxiv.org/abs/2411.05006)
Keywords: diffusion
Abstract: This paper proposes ProEdit - a simple yet effective framework for high-quality 3D scene editing guided by diffusion distillation in a novel progressive manner. Inspired by the crucial observation that multi-view inconsistency in scene editing is rooted in the diffusion model's large feasible output space (FOS), our framework controls the size of FOS and reduces inconsistency by decomposing the overall editing task into several subtasks, which are then executed progressively on the scene. Within this framework, we design a difficulty-aware subtask decomposition scheduler and an adaptive 3D Gaussian splatting (3DGS) training strategy, ensuring high quality and efficiency in performing each subtask. Extensive evaluation shows that our ProEdit achieves state-of-the-art results in various scenes and challenging editing tasks, all through a simple framework without any expensive or sophisticated add-ons like distillation losses, components, or training procedures. Notably, ProEdit also provides a new way to control, preview, and select the "aggressivity" of editing operation during the editing process.

Title: SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.05007
Pdf URL: https://arxiv.org/pdf/2411.05007
Copy Paste: [[2411.05007]] SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models(https://arxiv.org/abs/2411.05007)
Keywords: diffusion
Abstract: Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"ıvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-$\Sigma$, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5$\times$, achieving 3.0$\times$ speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.