2024-12-06

Title: HunyuanVideo: A Systematic Framework For Large Video Generative Models

Authors: Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Bai Jiawang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Daquan Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Caesar Zhong (Refer to the report for detailed contributions)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03603
Pdf URL: https://arxiv.org/pdf/2412.03603
Copy Paste: [[2412.03603]] HunyuanVideo: A Systematic Framework For Large Video Generative Models(https://arxiv.org/abs/2412.03603)
Keywords: foundation model, generative
Abstract: Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.

Title: MV-Adapter: Multi-view Consistent Image Generation Made Easy

Authors: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03632
Pdf URL: https://arxiv.org/pdf/2412.03632
Copy Paste: [[2412.03632]] MV-Adapter: Multi-view Consistent Image Generation Made Easy(https://arxiv.org/abs/2412.03632)
Keywords: diffusion
Abstract: Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.

Title: Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

Authors: Justin Theiss, Norman Müller, Daeil Kim, Aayush Prakash
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03756
Pdf URL: https://arxiv.org/pdf/2412.03756
Copy Paste: [[2412.03756]] Multi-view Image Diffusion via Coordinate Noise and Fourier Attention(https://arxiv.org/abs/2412.03756)
Keywords: diffusion
Abstract: Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.

Title: The broader spectrum of in-context learning

Authors: Andrew Kyle Lampinen, Stephanie C. Y. Chan, Aaditya K. Singh, Murray Shanahan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03782
Pdf URL: https://arxiv.org/pdf/2412.03782
Copy Paste: [[2412.03782]] The broader spectrum of in-context learning(https://arxiv.org/abs/2412.03782)
Keywords: in-context
Abstract: The ability of language models to learn a task from a few examples in context has generated substantial interest. Here, we provide a perspective that situates this type of supervised few-shot learning within a much broader spectrum of meta-learned in-context learning. Indeed, we suggest that any distribution of sequences in which context non-trivially decreases loss on subsequent predictions can be interpreted as eliciting a kind of in-context learning. We suggest that this perspective helps to unify the broad set of in-context abilities that language models exhibit $\unicode{x2014}$ such as adapting to tasks from instructions or role play, or extrapolating time series. This perspective also sheds light on potential roots of in-context learning in lower-level processing of linguistic dependencies (e.g. coreference or parallel structures). Finally, taking this perspective highlights the importance of generalization, which we suggest can be studied along several dimensions: not only the ability to learn something novel, but also flexibility in learning from different presentations, and in applying what is learned. We discuss broader connections to past literature in meta-learning and goal-conditioned agents, and other perspectives on learning and adaptation. We close by suggesting that research on in-context learning should consider this broader spectrum of in-context capabilities and types of generalization.

Title: Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review

Authors: Sofiane Ennadir, Gabriela Zarzar Gandler, Filip Cornell, Lele Cao, Oleg Smirnov, Tianze Wang, Levente Zólyomi, Björn Brinne, Sahar Asadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03783
Pdf URL: https://arxiv.org/pdf/2412.03783
Copy Paste: [[2412.03783]] Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review(https://arxiv.org/abs/2412.03783)
Keywords: self-supervised
Abstract: Graphs are ubiquitous in real-world applications, ranging from social networks to biological systems, and have inspired the development of Graph Neural Networks (GNNs) for learning expressive representations. While most research has centered on static graphs, many real-world scenarios involve dynamic, temporally evolving graphs, motivating the need for Continuous-Time Dynamic Graph (CTDG) models. This paper provides a comprehensive review of Graph Representation Learning (GRL) on CTDGs with a focus on Self-Supervised Representation Learning (SSRL). We introduce a novel theoretical framework that analyzes the expressivity of CTDG models through an Information-Flow (IF) lens, quantifying their ability to propagate and encode temporal and structural information. Leveraging this framework, we categorize existing CTDG methods based on their suitability for different graph types and application scenarios. Within the same scope, we examine the design of SSRL methods tailored to CTDGs, such as predictive and contrastive approaches, highlighting their potential to mitigate the reliance on labeled data. Empirical evaluations on synthetic and real-world datasets validate our theoretical insights, demonstrating the strengths and limitations of various methods across long-range, bi-partite and community-based graphs. This work offers both a theoretical foundation and practical guidance for selecting and developing CTDG models, advancing the understanding of GRL in dynamic settings.

Title: Coordinate In and Value Out: Training Flow Transformers in Ambient Space

Authors: Yuyang Wang, Anurag Ranjan, Josh Susskind, Miguel Angel Bautista
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03791
Pdf URL: https://arxiv.org/pdf/2412.03791
Copy Paste: [[2412.03791]] Coordinate In and Value Out: Training Flow Transformers in Ambient Space(https://arxiv.org/abs/2412.03791)
Keywords: generative
Abstract: Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on unstructured data like 3D point clouds. These models are commonly trained in two stages: first, a data compressor (i.e., a variational auto-encoder) is trained, and in a subsequent training stage a flow matching generative model is trained in the low-dimensional latent space of the data compressor. This two stage paradigm adds complexity to the overall training recipe and sets obstacles for unifying models across data domains, as specific data compressors are used for different data modalities. To this end, we introduce Ambient Space Flow Transformers (ASFT), a domain-agnostic approach to learn flow matching transformers in ambient space, sidestepping the requirement of training compressors and simplifying the training process. We introduce a conditionally independent point-wise training objective that enables ASFT to make predictions continuously in coordinate space. Our empirical results demonstrate that using general purpose transformer blocks, ASFT effectively handles different data modalities such as images and 3D point clouds, achieving strong performance in both domains and outperforming comparable approaches. ASFT is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

Title: EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Authors: Quang Nguyen, Truong Vu, Trong-Tung Nguyen, Yuxin Wen, Preston K Robinette, Taylor T Johnson, Tom Goldstein, Anh Tran, Khoi Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03809
Pdf URL: https://arxiv.org/pdf/2412.03809
Copy Paste: [[2412.03809]] EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM(https://arxiv.org/abs/2412.03809)
Keywords: diffusion
Abstract: Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.

Title: Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting

Authors: Guangben Lu, Yuzhen Du, Zhimin Sun, Ran Yi, Yifan Qi, Yizhe Tang, Tianyi Wang, Lizhuang Ma, Fangyuan Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03812
Pdf URL: https://arxiv.org/pdf/2412.03812
Copy Paste: [[2412.03812]] Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting(https://arxiv.org/abs/2412.03812)
Keywords: diffusion
Abstract: Foreground-conditioned inpainting aims to seamlessly fill the background region of an image by utilizing the provided foreground subject and a text description. While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. Firstly, we design a Self-Consistent Adapter that integrates the foreground subject features into the layout-related self-attention layer, which helps to alleviate conflicts between the text and subject features by ensuring that the model can effectively consider the foreground subject's characteristics while processing the overall image layout. Secondly, we design a Decoupled Image Feature Extraction method that employs distinct architectures to extract semantic and shape features separately, significantly improving subject feature extraction and ensuring high-quality preservation of the subject's shape. Thirdly, to ensure precise utilization of the extracted features and to focus attention on the subject region, we introduce a Shared Positional Embedding Anchor, greatly improving the model's understanding of subject features and boosting training efficiency. Extensive experiments demonstrate that our method achieves superior performance and efficiency in foreground-conditioned inpainting.

Title: CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP

Authors: Zuo Zuo, Jiahao Dong, Yao Wu, Yanyun Qu, Zongze Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03829
Pdf URL: https://arxiv.org/pdf/2412.03829
Copy Paste: [[2412.03829]] CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP(https://arxiv.org/abs/2412.03829)
Keywords: anomaly
Abstract: Industrial anomaly classification (AC) is an indispensable task in industrial manufacturing, which guarantees quality and safety of various product. To address the scarcity of data in industrial scenarios, lots of few-shot anomaly detection methods emerge recently. In this paper, we propose an effective few-shot anomaly classification (FSAC) framework with one-stage training, dubbed CLIP-FSAC++. Specifically, we introduce a cross-modality interaction module named Anomaly Descriptor following image and text encoders, which enhances the correlation of visual and text embeddings and adapts the representations of CLIP from pre-trained data to target data. In anomaly descriptor, image-to-text cross-attention module is used to obtain image-specific text embeddings and text-to-image cross-attention module is used to obtain text-specific visual embeddings. Then these modality-specific embeddings are used to enhance original representations of CLIP for better matching ability. Comprehensive experiment results are provided for evaluating our method in few-normal shot anomaly classification on VisA and MVTEC-AD for 1, 2, 4 and 8-shot settings. The source codes are at this https URL

Title: A large language model-type architecture for high-dimensional molecular potential energy surfaces

Authors: Xiao Zhu, Srinivasan S. Iyengar
Subjects: cs.LG, physics.atm-clus, physics.chem-ph, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2412.03831
Pdf URL: https://arxiv.org/pdf/2412.03831
Copy Paste: [[2412.03831]] A large language model-type architecture for high-dimensional molecular potential energy surfaces(https://arxiv.org/abs/2412.03831)
Keywords: generative
Abstract: Computing high dimensional potential surfaces for molecular and materials systems is considered to be a great challenge in computational chemistry with potential impact in a range of areas including fundamental prediction of reaction rates. In this paper we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 dimensions. Essentially a family of neural networks that pertain to the graph-based subsystems, get the job done for this 51 dimensional system. We then ask if this same family of lower-dimensional neural networks can be transformed to provide accurate predictions for a 186 dimensional potential surface. We find that our algorithm does provide reasonably accurate results for this larger dimensional problem with sub-kcal/mol accuracy for the higher dimensional potential surface problem.

Title: CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Authors: Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03859
Pdf URL: https://arxiv.org/pdf/2412.03859
Copy Paste: [[2412.03859]] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation(https://arxiv.org/abs/2412.03859)
Keywords: diffusion
Abstract: Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at this https URL.

Title: Training MLPs on Graphs without Supervision

Authors: Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2412.03864
Pdf URL: https://arxiv.org/pdf/2412.03864
Copy Paste: [[2412.03864]] Training MLPs on Graphs without Supervision(https://arxiv.org/abs/2412.03864)
Keywords: self-supervised
Abstract: Graph Neural Networks (GNNs) have demonstrated their effectiveness in various graph learning tasks, yet their reliance on neighborhood aggregation during inference poses challenges for deployment in latency-sensitive applications, such as real-time financial fraud detection. To address this limitation, recent studies have proposed distilling knowledge from teacher GNNs into student Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate inference. However, these approaches often inadequately explore structural information when inferring unseen nodes. To this end, we introduce SimMLP, a Self-supervised framework for learning MLPs on graphs, designed to fully integrate rich structural information into MLPs. Notably, SimMLP is the first MLP-learning method that can achieve equivalence to GNNs in the optimal case. The key idea is to employ self-supervised learning to align the representations encoded by graph context-aware GNNs and neighborhood dependency-free MLPs, thereby fully integrating the structural information into MLPs. We provide a comprehensive theoretical analysis, demonstrating the equivalence between SimMLP and GNNs based on mutual information and inductive bias, highlighting SimMLP's advanced structural learning capabilities. Additionally, we conduct extensive experiments on 20 benchmark datasets, covering node classification, link prediction, and graph classification, to showcase SimMLP's superiority over state-of-the-art baselines, particularly in scenarios involving unseen nodes (e.g., inductive and cold-start node classification) where structural insights are crucial. Our codes are available at: this https URL.

Title: Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

Authors: Jiangweizhi Peng, Zhiwei Tang, Gaowen Liu, Charles Fleming, Mingyi Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03876
Pdf URL: https://arxiv.org/pdf/2412.03876
Copy Paste: [[2412.03876]] Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization(https://arxiv.org/abs/2412.03876)
Keywords: diffusion
Abstract: Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.

Title: DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

Authors: Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03878
Pdf URL: https://arxiv.org/pdf/2412.03878
Copy Paste: [[2412.03878]] DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism(https://arxiv.org/abs/2412.03878)
Keywords: diffusion, generative
Abstract: The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.

Title: Transferring self-supervised pre-trained models for SHM data anomaly detection with scarce labeled data

Authors: Mingyuan Zhou, Xudong Jian, Ye Xia, Zhilu Lai
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2412.03880
Pdf URL: https://arxiv.org/pdf/2412.03880
Copy Paste: [[2412.03880]] Transferring self-supervised pre-trained models for SHM data anomaly detection with scarce labeled data(https://arxiv.org/abs/2412.03880)
Keywords: self-supervised, anomaly
Abstract: Structural health monitoring (SHM) has experienced significant advancements in recent decades, accumulating massive monitoring data. Data anomalies inevitably exist in monitoring data, posing significant challenges to their effective utilization. Recently, deep learning has emerged as an efficient and effective approach for anomaly detection in bridge SHM. Despite its progress, many deep learning models require large amounts of labeled data for training. The process of labeling data, however, is labor-intensive, time-consuming, and often impractical for large-scale SHM datasets. To address these challenges, this work explores the use of self-supervised learning (SSL), an emerging paradigm that combines unsupervised pre-training and supervised fine-tuning. The SSL-based framework aims to learn from only a very small quantity of labeled data by fine-tuning, while making the best use of the vast amount of unlabeled SHM data by pre-training. Mainstream SSL methods are compared and validated on the SHM data of two in-service bridges. Comparative analysis demonstrates that SSL techniques boost data anomaly detection performance, achieving increased F1 scores compared to conventional supervised training, especially given a very limited amount of labeled data. This work manifests the effectiveness and superiority of SSL techniques on large-scale SHM data, providing an efficient tool for preliminary anomaly detection with scarce label information.

Title: A Noise is Worth Diffusion Guidance

Authors: Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03895
Pdf URL: https://arxiv.org/pdf/2412.03895
Copy Paste: [[2412.03895]] A Noise is Worth Diffusion Guidance(https://arxiv.org/abs/2412.03895)
Keywords: diffusion
Abstract: Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: this https URL.

Title: ONER: Online Experience Replay for Incremental Anomaly Detection

Authors: Yizhou Jin, Jiahui Zhu, Guodong Wang, Shiwei Li, Jinjin Zhang, Qingjie Liu, Xinyue Liu, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03907
Pdf URL: https://arxiv.org/pdf/2412.03907
Copy Paste: [[2412.03907]] ONER: Online Experience Replay for Incremental Anomaly Detection(https://arxiv.org/abs/2412.03907)
Keywords: anomaly
Abstract: Incremental anomaly detection sequentially recognizes abnormal regions in novel categories for dynamic industrial scenarios. This remains highly challenging due to knowledge overwriting and feature conflicts, leading to catastrophic forgetting. In this work, we propose ONER, an end-to-end ONline Experience Replay method, which efficiently mitigates catastrophic forgetting while adapting to new tasks with minimal cost. Specifically, our framework utilizes two types of experiences from past tasks: decomposed prompts and semantic prototypes, addressing both model parameter updates and feature optimization. The decomposed prompts consist of learnable components that assemble to produce attention-conditioned prompts. These prompts reuse previously learned knowledge, enabling model to learn novel tasks effectively. The semantic prototypes operate at both pixel and image levels, performing regularization in the latent feature space to prevent forgetting across various tasks. Extensive experiments demonstrate that our method achieves state-of-the-art performance in incremental anomaly detection with significantly reduced forgetting, as well as efficiently adapting to new categories with minimal costs. These results confirm the efficiency and stability of ONER, making it a powerful solution for real-world applications.

Title: Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications

Authors: Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew, Hui Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03924
Pdf URL: https://arxiv.org/pdf/2412.03924
Copy Paste: [[2412.03924]] Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications(https://arxiv.org/abs/2412.03924)
Keywords: generative
Abstract: With the rapid advancement of artificial intelligence and deep learning, medical image analysis has become a critical tool in modern healthcare, significantly improving diagnostic accuracy and efficiency. However, AI-based methods also raise serious privacy concerns, as medical images often contain highly sensitive patient information. This review offers a comprehensive overview of privacy-preserving techniques in medical image analysis, including encryption, differential privacy, homomorphic encryption, federated learning, and generative adversarial networks. We explore the application of these techniques across various medical image analysis tasks, such as diagnosis, pathology, and telemedicine. Notably, we organizes the review based on specific challenges and their corresponding solutions in different medical image analysis applications, so that technical applications are directly aligned with practical issues, addressing gaps in the current research landscape. Additionally, we discuss emerging trends, such as zero-knowledge proofs and secure multi-party computation, offering insights for future research. This review serves as a valuable resource for researchers and practitioners and can help advance privacy-preserving in medical image analysis.

Title: InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Authors: Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, Jiahui Huang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.03934
Pdf URL: https://arxiv.org/pdf/2412.03934
Copy Paste: [[2412.03934]] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models(https://arxiv.org/abs/2412.03934)
Keywords: generative
Abstract: We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.

Title: AIpparel: A Large Multimodal Generative Model for Digital Garments

Authors: Kiyohiro Nakayama, Jan Ackermann, Timur Levent Kesdogan, Yang Zheng, Maria Korosteleva, Olga Sorkine-Hornung, Leonidas J. Guibas, Guandao Yang, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03937
Pdf URL: https://arxiv.org/pdf/2412.03937
Copy Paste: [[2412.03937]] AIpparel: A Large Multimodal Generative Model for Digital Garments(https://arxiv.org/abs/2412.03937)
Keywords: generative
Abstract: Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. \methodname achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and enables novel multimodal garment generation applications such as interactive garment editing. The project website is at this http URL.

Title: Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization

Authors: Tianyu Chen, Zhendong Wang, Mingyuan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03941
Pdf URL: https://arxiv.org/pdf/2412.03941
Copy Paste: [[2412.03941]] Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization(https://arxiv.org/abs/2412.03941)
Keywords: diffusion
Abstract: Diffusion models have recently demonstrated notable success in solving inverse problems. However, current diffusion model-based solutions typically require a large number of function evaluations (NFEs) to generate high-quality images conditioned on measurements, as they incorporate only limited information at each step. To accelerate the diffusion-based inverse problem-solving process, we introduce \textbf{M}easurements \textbf{O}ptimization (MO), a more efficient plug-and-play module for integrating measurement information at each step of the inverse problem-solving process. This method is comprehensively evaluated across eight diverse linear and nonlinear tasks on the FFHQ and ImageNet datasets. By using MO, we establish state-of-the-art (SOTA) performance across multiple tasks, with key advantages: (1) it operates with no more than 100 NFEs, with phase retrieval on ImageNet being the sole exception; (2) it achieves SOTA or near-SOTA results even at low NFE counts; and (3) it can be seamlessly integrated into existing diffusion model-based solutions for inverse problems, such as DPS \cite{chung2022diffusion} and Red-diff \cite{mardani2023variational}. For example, DPS-MO attains a peak signal-to-noise ratio (PSNR) of 28.71 dB on the FFHQ 256 dataset for high dynamic range imaging, setting a new SOTA benchmark with only 100 NFEs, whereas current methods require between 1000 and 4000 NFEs for comparable performance.

Title: A Framework For Image Synthesis Using Supervised Contrastive Learning

Authors: Yibin Liu, Jianyu Zhang, Li Zhang, Shijian Li, Gang Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03957
Pdf URL: https://arxiv.org/pdf/2412.03957
Copy Paste: [[2412.03957]] A Framework For Image Synthesis Using Supervised Contrastive Learning(https://arxiv.org/abs/2412.03957)
Keywords: generative
Abstract: Text-to-image (T2I) generation aims at producing realistic images corresponding to text descriptions. Generative Adversarial Network (GAN) has proven to be successful in this task. Typical T2I GANs are 2 phase methods that first pretrain an inter-modal representation from aligned image-text pairs and then use GAN to train image generator on that basis. However, such representation ignores the inner-modal semantic correspondence, e.g. the images with same label. The semantic label in priory describes the inherent distribution pattern with underlying cross-image relationships, which is supplement to the text description for understanding the full characteristics of image. In this paper, we propose a framework leveraging both inter- and inner-modal correspondence by label guided supervised contrastive learning. We extend the T2I GANs to two parameter-sharing contrast branches in both pretraining and generation phases. This integration effectively clusters the semantically similar image-text pair representations, thereby fostering the generation of higher-quality images. We demonstrate our framework on four novel T2I GANs by both single-object dataset CUB and multi-object dataset COCO, achieving significant improvements in the Inception Score (IS) and Frechet Inception Distance (FID) metrics of imagegeneration evaluation. Notably, on more complex multi-object COCO, our framework improves FID by 30.1%, 27.3%, 16.2% and 17.1% for AttnGAN, DM-GAN, SSA-GAN and GALIP, respectively. We also validate our superiority by comparing with other label guided T2I GANs. The results affirm the effectiveness and competitiveness of our approach in advancing the state-of-the-art GAN for T2I generation

Title: Local Curvature Smoothing with Stein's Identity for Efficient Score Matching

Authors: Genki Osada, Makoto Shing, Takashi Nishide
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.03962
Pdf URL: https://arxiv.org/pdf/2412.03962
Copy Paste: [[2412.03962]] Local Curvature Smoothing with Stein's Identity for Efficient Score Matching(https://arxiv.org/abs/2412.03962)
Keywords: diffusion
Abstract: The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Jacobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score. We propose a novel score matching variant, local curvature smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by applying Stein's identity, enabling regularization effectiveness and efficient computation. We show that LCSS surpasses existing methods in sample generation performance and matches the performance of denoising score matching, widely adopted by most SDMs, in evaluations such as FID, Inception score, and bits per dimension. Furthermore, we show that LCSS enables realistic image generation even at a high resolution of $1024 \times 1024$.

Title: IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Authors: Sejong Yang, Seoung Wug Oh, Yang Zhou, Seon Joo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04000
Pdf URL: https://arxiv.org/pdf/2412.04000
Copy Paste: [[2412.04000]] IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation(https://arxiv.org/abs/2412.04000)
Keywords: diffusion, generative
Abstract: We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on this https URL.

Title: PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

Authors: Kangan Qian, Xinyu Jiao, Yining Shi, Yunlong Wang, Ziang Luo, Zheng Fu, Kun Jiang, Diange Yang
Subjects: cs.CV, cs.PF, cs.RO
Abstract URL: https://arxiv.org/abs/2412.04020
Pdf URL: https://arxiv.org/pdf/2412.04020
Copy Paste: [[2412.04020]] PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors(https://arxiv.org/abs/2412.04020)
Keywords: generative
Abstract: Reliable perception of spatial and motion information is crucial for safe autonomous navigation. Traditional approaches typically fall into two categories: object-centric and class-agnostic methods. While object-centric methods often struggle with missed detections, leading to inaccuracies in motion prediction, many class-agnostic methods focus heavily on encoder design, often overlooking important priors like rigidity and temporal consistency, leading to suboptimal performance, particularly with sparse LiDAR data at distant region. To address these issues, we propose $\textbf{PriorMotion}$, a generative framework that extracts rasterized and vectorized scene representations to model spatio-temporal priors. Our model comprises a BEV encoder, an Raster-Vector prior Encoder, and a Spatio-Temporal prior Generator, improving both spatial and temporal consistency in motion prediction. Additionally, we introduce a standardized evaluation protocol for class-agnostic motion prediction. Experiments on the nuScenes dataset show that PriorMotion achieves state-of-the-art performance, with further validation on advanced FMCW LiDAR confirming its robustness.

Title: Boundary-Guided Learning for Gene Expression Prediction in Spatial Transcriptomics

Authors: Mingcheng Qu, Yuncong Wu, Donglin Di, Anyang Su, Tonghua Su, Yang Song, Lei Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04072
Pdf URL: https://arxiv.org/pdf/2412.04072
Copy Paste: [[2412.04072]] Boundary-Guided Learning for Gene Expression Prediction in Spatial Transcriptomics(https://arxiv.org/abs/2412.04072)
Keywords: in-context
Abstract: Spatial transcriptomics (ST) has emerged as an advanced technology that provides spatial context to gene expression. Recently, deep learning-based methods have shown the capability to predict gene expression from WSI data using ST data. Existing approaches typically extract features from images and the neighboring regions using pretrained models, and then develop methods to fuse this information to generate the final output. However, these methods often fail to account for the cellular structure similarity, cellular density and the interactions within the microenvironment. In this paper, we propose a framework named BG-TRIPLEX, which leverages boundary information extracted from pathological images as guiding features to enhance gene expression prediction from WSIs. Specifically, our model consists of three branches: the spot, in-context and global branches. In the spot and in-context branches, boundary information, including edge and nuclei characteristics, is extracted using pretrained models. These boundary features guide the learning of cellular morphology and the characteristics of microenvironment through Multi-Head Cross-Attention. Finally, these features are integrated with global features to predict the final output. Extensive experiments were conducted on three public ST datasets. The results demonstrate that our BG-TRIPLEX consistently outperforms existing methods in terms of Pearson Correlation Coefficient (PCC). This method highlights the crucial role of boundary features in understanding the complex interactions between WSI and gene expression, offering a promising direction for future research.

Title: SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning

Authors: Seokju Yun, Seunghye Chae, Dongheon Lee, Youngmin Ro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04077
Pdf URL: https://arxiv.org/pdf/2412.04077
Copy Paste: [[2412.04077]] SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning(https://arxiv.org/abs/2412.04077)
Keywords: foundation model
Abstract: Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Low-Rank Adaptation (SoRA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoRA effectively retains the generalization ability of the pre-trained model while efficiently acquiring task-specific skills. Furthermore, we freeze domain-generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade-off between generalizability and discriminability. SoRA attains state-of-the-art results on multiple benchmarks that span both domain generalized semantic segmentation to domain generalized object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of tasks.

Title: MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Authors: Haoning Wu, Ziheng Zhao, Ya Zhang, Weidi Xie, Yanfeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04106
Pdf URL: https://arxiv.org/pdf/2412.04106
Copy Paste: [[2412.04106]] MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities(https://arxiv.org/abs/2412.04106)
Keywords: diffusion, generative
Abstract: Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: (i) we collect and curate a large-scale radiology image-text dataset, MedGen-1M, comprising modality labels, attributes, region, and organ information, along with a subset of organ mask annotations, to support research in controllable medical image generation; (ii) we propose a diffusion-based data engine, termed MRGen, which enables generation conditioned on text prompts and masks, synthesizing MR images for diverse modalities lacking mask annotations, to train segmentation models on unannotated modalities; (iii) we conduct extensive experiments across various modalities, illustrating that our data engine can effectively synthesize training samples and extend MRI segmentation towards unannotated modalities.

Title: Compositional Generative Multiphysics and Multi-component Simulation

Authors: Tao Zhang, Zhenhai Liu, Feipeng Qi, Yongjun Jiao, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04134
Pdf URL: https://arxiv.org/pdf/2412.04134
Copy Paste: [[2412.04134]] Compositional Generative Multiphysics and Multi-component Simulation(https://arxiv.org/abs/2412.04134)
Keywords: diffusion, generative
Abstract: Multiphysics simulation, which models the interactions between multiple physical processes, and multi-component simulation of complex structures are critical in fields like nuclear and aerospace engineering. Previous studies often rely on numerical solvers or machine learning-based surrogate models to solve or accelerate these simulations. However, multiphysics simulations typically require integrating multiple specialized solvers-each responsible for evolving a specific physical process-into a coupled program, which introduces significant development challenges. Furthermore, no universal algorithm exists for multi-component simulations, which adds to the complexity. Here we propose compositional Multiphysics and Multi-component Simulation with Diffusion models (MultiSimDiff) to overcome these challenges. During diffusion-based training, MultiSimDiff learns energy functions modeling the conditional probability of one physical process/component conditioned on other processes/components. In inference, MultiSimDiff generates coupled multiphysics solutions and multi-component structures by sampling from the joint probability distribution, achieved by composing the learned energy functions in a structured way. We test our method in three tasks. In the reaction-diffusion and nuclear thermal coupling problems, MultiSimDiff successfully predicts the coupling solution using decoupled data, while the surrogate model fails in the more complex second problem. For the thermal and mechanical analysis of the prismatic fuel element, MultiSimDiff trained for single component prediction accurately predicts a larger structure with 64 components, reducing the relative error by 40.3% compared to the surrogate model.

Title: Understanding Memorization in Generative Models via Sharpness in Probability Landscapes

Authors: Dongjae Jeon, Dueun Kim, Albert No
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04140
Pdf URL: https://arxiv.org/pdf/2412.04140
Copy Paste: [[2412.04140]] Understanding Memorization in Generative Models via Sharpness in Probability Landscapes(https://arxiv.org/abs/2412.04140)
Keywords: diffusion, generative
Abstract: In this paper, we introduce a geometric framework to analyze memorization in diffusion models using the eigenvalues of the Hessian of the log probability density. We propose that memorization arises from isolated points in the learned probability distribution, characterized by sharpness in the probability landscape, as indicated by large negative eigenvalues of the Hessian. Through experiments on various datasets, we demonstrate that these eigenvalues effectively detect and quantify memorization. Our approach provides a clear understanding of memorization in diffusion models and lays the groundwork for developing strategies to ensure secure and reliable generative models

Title: AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Authors: Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04146
Pdf URL: https://arxiv.org/pdf/2412.04146
Copy Paste: [[2412.04146]] AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models(https://arxiv.org/abs/2412.04146)
Keywords: diffusion
Abstract: Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as a plug-in module to easily integrate with any community control extensions for diffusion models, improving the diversity and controllability of synthesized images. Extensive experiments show that AnyDressing achieves state-of-the-art results.

Title: Instructional Video Generation

Authors: Yayuan Li, Zhi Cao, Jason J. Corso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04189
Pdf URL: https://arxiv.org/pdf/2412.04189
Copy Paste: [[2412.04189]] Instructional Video Generation(https://arxiv.org/abs/2412.04189)
Keywords: diffusion
Abstract: Despite the recent strides in video generation, state-of-the-art methods still struggle with elements of visual detail. One particularly challenging case is the class of egocentric instructional videos in which the intricate motion of the hand coupled with a mostly stable and non-distracting environment is necessary to convey the appropriate visual action instruction. To address these challenges, we introduce a new method for instructional video generation. Our diffusion-based method incorporates two distinct innovations. First, we propose an automatic method to generate the expected region of motion, guided by both the visual context and the action text. Second, we introduce a critical hand structure loss to guide the diffusion model to focus on smooth and consistent hand poses. We evaluate our method on augmented instructional datasets based on EpicKitchens and Ego4D, demonstrating significant improvements over state-of-the-art methods in terms of instructional clarity, especially of the hand motion in the target region, across diverse environments and this http URL results can be found on the project webpage: this https URL

Title: PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models

Authors: Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, Andrea Nascetti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04204
Pdf URL: https://arxiv.org/pdf/2412.04204
Copy Paste: [[2412.04204]] PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models(https://arxiv.org/abs/2412.04204)
Keywords: foundation model
Abstract: Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at this https URL.

Title: VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Authors: Jiahao Zhang, Ryota Yoshihashi, Shunsuke Kitada, Atsuki Osanai, Yuta Nakashima
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04237
Pdf URL: https://arxiv.org/pdf/2412.04237
Copy Paste: [[2412.04237]] VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction(https://arxiv.org/abs/2412.04237)
Keywords: generative
Abstract: Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.

Title: LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation

Authors: Xiang Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04242
Pdf URL: https://arxiv.org/pdf/2412.04242
Copy Paste: [[2412.04242]] LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation(https://arxiv.org/abs/2412.04242)
Keywords: diffusion
Abstract: n this work, we propose a latent molecular diffusion model that can make the generated 3D molecules rich in diversity and maintain rich geometric features. The model captures the information of the forces and local constraints between atoms so that the generated molecules can maintain Euclidean transformation and high level of effectiveness and diversity. We also use the lowerrank manifold advantage of the latent variables of the latent model to fuse the information of the forces between atoms to better maintain the geometric equivariant properties of the molecules. Because there is no need to perform information fusion encoding in stages like traditional encoders and decoders, this reduces the amount of calculation in the back-propagation process. The model keeps the forces and local constraints of particle bonds in the latent variable space, reducing the impact of underfitting on the surface of the network on the large position drift of the particle geometry, so that our model can converge earlier. We introduce a distribution control variable in each backward step to strengthen exploration and improve the diversity of generation. In the experiment, the quality of the samples we generated and the convergence speed of the model have been significantly improved.

Title: Quantifying the Limits of Segment Anything Model: Analyzing Challenges in Segmenting Tree-Like and Low-Contrast Structures

Authors: Yixin Zhang, Nicholas Konz, Kevin Kramer, Maciej A. Mazurowski
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.04243
Pdf URL: https://arxiv.org/pdf/2412.04243
Copy Paste: [[2412.04243]] Quantifying the Limits of Segment Anything Model: Analyzing Challenges in Segmenting Tree-Like and Low-Contrast Structures(https://arxiv.org/abs/2412.04243)
Keywords: foundation model
Abstract: Segment Anything Model (SAM) has shown impressive performance in interactive and zero-shot segmentation across diverse domains, suggesting that they have learned a general concept of "objects" from their large-scale training. However, we observed that SAM struggles with certain types of objects, particularly those featuring dense, tree-like structures and low textural contrast from their surroundings. These failure modes are critical for understanding its limitations in real-world use. In order to systematically examine this issue, we propose metrics to quantify two key object characteristics: tree-likeness and textural separability. Through extensive controlled synthetic experiments and testing on real datasets, we demonstrate that SAM's performance is noticeably correlated with these factors. We link these behaviors under the concept of "textural confusion", where SAM misinterprets local structure as global texture, leading to over-segmentation, or struggles to differentiate objects from similarly textured backgrounds. These findings offer the first quantitative framework to model SAM's challenges, providing valuable insights into its limitations and guiding future improvements for vision foundation models.

Title: SCADE: Scalable Command-line Anomaly Detection Engine

Authors: Vaishali Vinay, Anjali Mangal
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04259
Pdf URL: https://arxiv.org/pdf/2412.04259
Copy Paste: [[2412.04259]] SCADE: Scalable Command-line Anomaly Detection Engine(https://arxiv.org/abs/2412.04259)
Keywords: anomaly
Abstract: As command-line interfaces remain an integral part of high-computation environments, the risk of exploitation through stealthy, complex command-line abuse continues to grow. Conventional security solutions often struggle with these command-line-based anomalies due to their context-specific nature and lack of labeled data, especially in detecting rare, malicious patterns amidst legitimate, high-volume activity. This gap has left organizations vulnerable to sophisticated threats like Living-off-the-Land (LOL) attacks, where standard detection tools frequently miss or misclassify anomalous command-line behavior. We introduce Scalable Command-Line Anomaly Detection Engine (SCADE), who addresses these challenges by introducing a dual-layered detection framework that combines a global statistical analysis with local context-specific anomaly detection, innovatively using a novel ensemble of statistical models such as BM25 and Log Entropy, adapted for command-line data. The framework also features a dynamic thresholding mechanism for adaptive anomaly detection, ensuring high precision and recall even in environments with extremely high Signal-to-Noise Ratios (SNRs). Initial experimental results demonstrate the effectiveness of the framework, achieving above 98% SNR in identifying unusual command-line behavior while minimizing false positives. In this paper, we present SCADE's core architecture, including its metadata-enriched approach to anomaly detection and the design choices behind its scalability for enterprise-level deployment. We argue that SCADE represents a significant advancement in command-line anomaly detection, offering a robust, adaptive framework for security analysts and researchers seeking to enhance detection accuracy in high-computation environments.

Title: SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

Authors: Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04262
Pdf URL: https://arxiv.org/pdf/2412.04262
Copy Paste: [[2412.04262]] SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction(https://arxiv.org/abs/2412.04262)
Keywords: generative
Abstract: Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.

Title: SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Authors: Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, Guangliang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04292
Pdf URL: https://arxiv.org/pdf/2412.04292
Copy Paste: [[2412.04292]] SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model(https://arxiv.org/abs/2412.04292)
Keywords: generative
Abstract: The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.

Title: SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Authors: Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04301
Pdf URL: https://arxiv.org/pdf/2412.04301
Copy Paste: [[2412.04301]] SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion(https://arxiv.org/abs/2412.04301)
Keywords: diffusion
Abstract: Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results. Our project page is at: this https URL

Title: Towards Zero-shot 3D Anomaly Localization

Authors: Yizhou Wang, Kuan-Chuan Peng, Yun Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04304
Pdf URL: https://arxiv.org/pdf/2412.04304
Copy Paste: [[2412.04304]] Towards Zero-shot 3D Anomaly Localization(https://arxiv.org/abs/2412.04304)
Keywords: anomaly
Abstract: 3D anomaly detection and localization is of great significance for industrial inspection. Prior 3D anomaly detection and localization methods focus on the setting that the testing data share the same category as the training data which is normal. However, in real-world applications, the normal training data for the target 3D objects can be unavailable due to issues like data privacy or export control regulation. To tackle these challenges, we identify a new task -- zero-shot 3D anomaly detection and localization, where the training and testing classes do not overlap. To this end, we design 3DzAL, a novel patch-level contrastive learning framework based on pseudo anomalies generated using the inductive bias from task-irrelevant 3D xyz data to learn more representative feature representations. Furthermore, we train a normalcy classifier network to classify the normal patches and pseudo anomalies and utilize the classification result jointly with feature distance to design anomaly scores. Instead of directly using the patch point clouds, we introduce adversarial perturbations to the input patch xyz data before feeding into the 3D normalcy classifier for the classification-based anomaly score. We show that 3DzAL outperforms the state-of-the-art anomaly detection and localization performance.

Title: The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Authors: Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04318
Pdf URL: https://arxiv.org/pdf/2412.04318
Copy Paste: [[2412.04318]] The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation(https://arxiv.org/abs/2412.04318)
Keywords: generative
Abstract: This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

Title: Retrieval-Augmented Machine Translation with Unstructured Knowledge

Authors: Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04342
Pdf URL: https://arxiv.org/pdf/2412.04342
Copy Paste: [[2412.04342]] Retrieval-Augmented Machine Translation with Unstructured Knowledge(https://arxiv.org/abs/2412.04342)
Keywords: in-context
Abstract: Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models' MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.

Title: RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

Authors: Zhouyingcheng Liao, Mingyuan Zhang, Wenjia Wang, Lei Yang, Taku Komura
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.04343
Pdf URL: https://arxiv.org/pdf/2412.04343
Copy Paste: [[2412.04343]] RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse(https://arxiv.org/abs/2412.04343)
Keywords: diffusion
Abstract: While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.

Title: ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Authors: Dayoung Gong, Suha Kwak, Minsu Cho
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04353
Pdf URL: https://arxiv.org/pdf/2412.04353
Copy Paste: [[2412.04353]] ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation(https://arxiv.org/abs/2412.04353)
Keywords: diffusion
Abstract: Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.

Title: Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting

Authors: Edoardo Cetin, Ahmed Touati, Yann Ollivier
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04368
Pdf URL: https://arxiv.org/pdf/2412.04368
Copy Paste: [[2412.04368]] Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting(https://arxiv.org/abs/2412.04368)
Keywords: foundation model
Abstract: The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training. First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre-trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introducing auto-regressive features for FB, which let finegrained task features depend on coarser-grained task information. This can represent arbitrary nonlinear task encodings, thus significantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires specific this http URL show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of standard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task generalization beyond the behaviors represented in the trainset.

Title: Discriminative Fine-tuning of LVLMs

Authors: Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04378
Pdf URL: https://arxiv.org/pdf/2412.04378
Copy Paste: [[2412.04378]] Discriminative Fine-tuning of LVLMs(https://arxiv.org/abs/2412.04378)
Keywords: generative
Abstract: Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Title: Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Authors: Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04424
Pdf URL: https://arxiv.org/pdf/2412.04424
Copy Paste: [[2412.04424]] Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion(https://arxiv.org/abs/2412.04424)
Keywords: foundation model, generative
Abstract: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. this https URL

Title: Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Authors: Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04431
Pdf URL: https://arxiv.org/pdf/2412.04431
Copy Paste: [[2412.04431]] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis(https://arxiv.org/abs/2412.04431)
Keywords: diffusion
Abstract: We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.

Title: Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Authors: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04432
Pdf URL: https://arxiv.org/pdf/2412.04432
Copy Paste: [[2412.04432]] Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation(https://arxiv.org/abs/2412.04432)
Keywords: diffusion, self-supervised
Abstract: In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.

Title: Towards Real-Time Open-Vocabulary Video Instance Segmentation

Authors: Bin Yan, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04434
Pdf URL: https://arxiv.org/pdf/2412.04434
Copy Paste: [[2412.04434]] Towards Real-Time Open-Vocabulary Video Instance Segmentation(https://arxiv.org/abs/2412.04434)
Keywords: foundation model
Abstract: In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at this https URL.

Title: Learning Artistic Signatures: Symmetry Discovery and Style Transfer

Authors: Emma Finn, T. Anderson Keller, Emmanouil Theodosis, Demba E. Ba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04441
Pdf URL: https://arxiv.org/pdf/2412.04441
Copy Paste: [[2412.04441]] Learning Artistic Signatures: Symmetry Discovery and Style Transfer(https://arxiv.org/abs/2412.04441)
Keywords: diffusion
Abstract: Despite nearly a decade of literature on style transfer, there is no undisputed definition of artistic style. State-of-the-art models produce impressive results but are difficult to interpret since, without a coherent definition of style, the problem of style transfer is inherently ill-posed. Early work framed style-transfer as an optimization problem but treated style as a measure only of texture. This led to artifacts in the outputs of early models where content features from the style image sometimes bled into the output image. Conversely, more recent work with diffusion models offers compelling empirical results but provides little theoretical grounding. To address these issues, we propose an alternative definition of artistic style. We suggest that style should be thought of as a set of global symmetries that dictate the arrangement of local textures. We validate this perspective empirically by learning the symmetries of a large dataset of paintings and showing that symmetries are predictive of the artistic movement to which each painting belongs. Finally, we show that by considering both local and global features, using both Lie generators and traditional measures of texture, we can quantitatively capture the stylistic similarity between artists better than with either set of features alone. This approach not only aligns well with art historians' consensus but also offers a robust framework for distinguishing nuanced stylistic differences, allowing for a more interpretable, theoretically grounded approach to style transfer.

Title: DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ping Luo, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04446
Pdf URL: https://arxiv.org/pdf/2412.04446
Copy Paste: [[2412.04446]] DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models(https://arxiv.org/abs/2412.04446)
Keywords: diffusion
Abstract: Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training. We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality while ensuring efficient training. To showcase its scalability, we release a series of DiCoDe configurations with varying parameter sizes and observe a consistent improvement in performance as the model size increases from 100M to 3B. We believe that DiCoDe's exploration in academia represents a promising initial step toward scalable video modeling with AR language models, paving the way for the development of larger and more powerful video generation models.

Title: MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Authors: Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04448
Pdf URL: https://arxiv.org/pdf/2412.04448
Copy Paste: [[2412.04448]] MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation(https://arxiv.org/abs/2412.04448)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

Title: Four-Plane Factorized Video Autoencoders

Authors: Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04452
Pdf URL: https://arxiv.org/pdf/2412.04452
Copy Paste: [[2412.04452]] Four-Plane Factorized Video Autoencoders(https://arxiv.org/abs/2412.04452)
Keywords: diffusion, generative
Abstract: Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion models (LDMs), such as class-conditional generation, frame prediction, and video interpolation. Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions despite the heavy compression, while simultaneously enabling LDMs to operate with significant improvements in speed and memory.

Title: LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

Authors: Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04460
Pdf URL: https://arxiv.org/pdf/2412.04460
Copy Paste: [[2412.04460]] LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors(https://arxiv.org/abs/2412.04460)
Keywords: diffusion, generative
Abstract: Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.

Title: 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Authors: Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04462
Pdf URL: https://arxiv.org/pdf/2412.04462
Copy Paste: [[2412.04462]] 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion(https://arxiv.org/abs/2412.04462)
Keywords: diffusion
Abstract: We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization. This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore and Dust3R-Confidence).

Title: Turbo3D: Ultra-fast Text-to-3D Generation

Authors: Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04470
Pdf URL: https://arxiv.org/pdf/2412.04470
Copy Paste: [[2412.04470]] Turbo3D: Ultra-fast Text-to-3D Generation(https://arxiv.org/abs/2412.04470)
Keywords: diffusion
Abstract: We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.

Title: PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Authors: Vinayak Gupta, Yunze Man, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04471
Pdf URL: https://arxiv.org/pdf/2412.04471
Copy Paste: [[2412.04471]] PaintScene4D: Consistent 4D Scene Generation from Text Prompts(https://arxiv.org/abs/2412.04471)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at this https URL

Title: Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

Authors: Luca Bartolomei, Fabio Tosi, Matteo Poggi, Stefano Mattoccia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04472
Pdf URL: https://arxiv.org/pdf/2412.04472
Copy Paste: [[2412.04472]] Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail(https://arxiv.org/abs/2412.04472)
Keywords: foundation model
Abstract: We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.