2024-11-27

Title: Leveraging Conversational Generative AI for Anomaly Detection in Digital Substations

Authors: Aydin Zaboli, Seong Lok Choi, Junho Hong
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2411.16692
Pdf URL: https://arxiv.org/pdf/2411.16692
Copy Paste: [[2411.16692]] Leveraging Conversational Generative AI for Anomaly Detection in Digital Substations(https://arxiv.org/abs/2411.16692)
Keywords: generative, anomaly
Abstract: This study addresses critical challenges of cybersecurity in digital substations by proposing an innovative task-oriented dialogue (ToD) system for anomaly detection (AD) in multicast messages, specifically, generic object oriented substation event (GOOSE) and sampled value (SV) datasets. Leveraging generative artificial intelligence (GenAI) technology, the proposed framework demonstrates superior error reduction, scalability, and adaptability compared with traditional human-in-the-loop (HITL) processes. Notably, this methodology offers significant advantages over machine learning (ML) techniques in terms of efficiency and implementation speed when confronting novel and/or unknown cyber threats, while also maintaining model complexity and precision. The research employs advanced performance metrics to conduct a comparative assessment between the proposed AD and HITL-based AD frameworks, utilizing a hardware-in-the-loop (HIL) testbed for generating and extracting features of IEC61850 communication messages. This approach presents a promising solution for enhancing the reliability of power system operations in the face of evolving cybersecurity challenges.

Title: Conditional Text-to-Image Generation with Reference Guidance

Authors: Taewook Kim, Ze Wang, Zhengyuan Yang, Jiang Wang, Lijuan Wang, Zicheng Liu, Qiang Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16713
Pdf URL: https://arxiv.org/pdf/2411.16713
Copy Paste: [[2411.16713]] Conditional Text-to-Image Generation with Reference Guidance(https://arxiv.org/abs/2411.16713)
Keywords: diffusion
Abstract: Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model's generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.

Title: TPIE: Topology-Preserved Image Editing With Text Instructions

Authors: Nivetha Jayakumar, Srivardhan Reddy Gadila, Tonmoy Hossain, Yangfeng Ji, Miaomiao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16714
Pdf URL: https://arxiv.org/pdf/2411.16714
Copy Paste: [[2411.16714]] TPIE: Topology-Preserved Image Editing With Text Instructions(https://arxiv.org/abs/2411.16714)
Keywords: diffusion, generative
Abstract: Preserving topological structures is important in real-world applications, particularly in sensitive domains such as healthcare and medicine, where the correctness of human anatomy is critical. However, most existing image editing models focus on manipulating intensity and texture features, often overlooking object geometry within images. To address this issue, this paper introduces a novel method, Topology-Preserved Image Editing with text instructions (TPIE), that for the first time ensures the topology and geometry remaining intact in edited images through text-guided generative diffusion models. More specifically, our method treats newly generated samples as deformable variations of a given input template, allowing for controllable and structure-preserving edits. Our proposed TPIE framework consists of two key modules: (i) an autoencoder-based registration network that learns latent representations of object transformations, parameterized by velocity fields, from pairwise training images; and (ii) a novel latent conditional geometric diffusion (LCDG) model efficiently capturing the data distribution of learned transformation features conditioned on custom-defined text instructions. We validate TPIE on a diverse set of 2D and 3D images and compare them with state-of-the-art image editing approaches. Experimental results show that our method outperforms other baselines in generating more realistic images with well-preserved topology. Our code will be made publicly available on Github.

Title: PaRCE: Probabilistic and Reconstruction-based Competency Estimation for CNN-based Image Classification

Authors: Sara Pohland, Claire Tomlin
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16715
Pdf URL: https://arxiv.org/pdf/2411.16715
Copy Paste: [[2411.16715]] PaRCE: Probabilistic and Reconstruction-based Competency Estimation for CNN-based Image Classification(https://arxiv.org/abs/2411.16715)
Keywords: anomaly
Abstract: Convolutional neural networks (CNNs) are extremely popular and effective for image classification tasks but tend to be overly confident in their predictions. Various works have sought to quantify uncertainty associated with these models, detect out-of-distribution (OOD) inputs, or identify anomalous regions in an image, but limited work has sought to develop a holistic approach that can accurately estimate perception model confidence across various sources of uncertainty. We develop a probabilistic and reconstruction-based competency estimation (PaRCE) method and compare it to existing approaches for uncertainty quantification and OOD detection. We find that our method can best distinguish between correctly classified, misclassified, and OOD samples with anomalous regions, as well as between samples with visual image modifications resulting in high, medium, and low prediction accuracy. We describe how to extend our approach for anomaly localization tasks and demonstrate the ability of our approach to distinguish between regions in an image that are familiar to the perception model from those that are unfamiliar. We find that our method generates interpretable scores that most reliably capture a holistic notion of perception model confidence.

Title: Importance-based Token Merging for Diffusion Models

Authors: Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16720
Pdf URL: https://arxiv.org/pdf/2411.16720
Copy Paste: [[2411.16720]] Importance-based Token Merging for Diffusion Models(https://arxiv.org/abs/2411.16720)
Keywords: diffusion
Abstract: Diffusion models excel at high-quality image and video generation. However, a major drawback is their high latency. A simple yet powerful way to speed them up is by merging similar tokens for faster computation, though this can result in some quality loss. In this paper, we demonstrate that preserving important tokens during merging significantly improves sample quality. Notably, the importance of each token can be reliably determined using the classifier-free guidance magnitude, as this measure is strongly correlated with the conditioning input and corresponds to output fidelity. Since classifier-free guidance incurs no additional computational cost or requires extra modules, our method can be easily integrated into most diffusion-based frameworks. Experiments show that our approach significantly outperforms the baseline across various applications, including text-to-image synthesis, multi-view image generation, and video generation.

Title: $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models

Authors: Dahye Kim, Xavier Thomas, Deepti Ghadiyaram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16725
Pdf URL: https://arxiv.org/pdf/2411.16725
Copy Paste: [[2411.16725]] $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models(https://arxiv.org/abs/2411.16725)
Keywords: diffusion
Abstract: We study $\textit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models' features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: this https URL

Title: EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Authors: Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, Qingfeng Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16726
Pdf URL: https://arxiv.org/pdf/2411.16726
Copy Paste: [[2411.16726]] EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion(https://arxiv.org/abs/2411.16726)
Keywords: diffusion
Abstract: Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.

Title: Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16738
Pdf URL: https://arxiv.org/pdf/2411.16738
Copy Paste: [[2411.16738]] Classifier-Free Guidance inside the Attraction Basin May Cause Memorization(https://arxiv.org/abs/2411.16738)
Keywords: diffusion
Abstract: Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, \emph{opposite guidance}, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.

Title: FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Authors: Junwei You, Rui Gan, Weizhe Tang, Zilin Huang, Jiaxi Liu, Zhuoyu Jiang, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu, Sikai Chen, Bin Ran
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2411.16747
Pdf URL: https://arxiv.org/pdf/2411.16747
Copy Paste: [[2411.16747]] FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction(https://arxiv.org/abs/2411.16747)
Keywords: diffusion, generative
Abstract: Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS). Although deep learning-based approaches - especially those utilizing transformer-based and generative models - have markedly improved prediction accuracy by capturing complex, non-linear patterns in vehicle dynamics and traffic interactions, they frequently overlook detailed car-following behaviors and the inter-vehicle interactions critical for real-world driving applications, particularly in fully autonomous or mixed traffic scenarios. To address the issue, this study introduces a scaled noise conditional diffusion model for car-following trajectory prediction, which integrates detailed inter-vehicular interactions and car-following dynamics into a generative framework, improving both the accuracy and plausibility of predicted trajectories. The model utilizes a novel pipeline to capture historical vehicle dynamics by scaling noise with encoded historical features within the diffusion process. Particularly, it employs a cross-attention-based transformer architecture to model intricate inter-vehicle dependencies, effectively guiding the denoising process and enhancing prediction accuracy. Experimental results on diverse real-world driving scenarios demonstrate the state-of-the-art performance and robustness of the proposed method.

Title: LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Authors: Haojie Zhang, Zhihao Liang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li, Jianhua Tao, Yaling Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16748
Pdf URL: https://arxiv.org/pdf/2411.16748
Copy Paste: [[2411.16748]] LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis(https://arxiv.org/abs/2411.16748)
Keywords: diffusion
Abstract: Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.

Title: AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

Authors: You Li, Fan Ma, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16749
Pdf URL: https://arxiv.org/pdf/2411.16749
Copy Paste: [[2411.16749]] AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks(https://arxiv.org/abs/2411.16749)
Keywords: diffusion
Abstract: Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.

Title: PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

Authors: Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2411.16750
Pdf URL: https://arxiv.org/pdf/2411.16750
Copy Paste: [[2411.16750]] PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation(https://arxiv.org/abs/2411.16750)
Keywords: diffusion
Abstract: This paper explores the potential of leveraging language priors learned by text-to-image diffusion models to address ambiguity and visual nuisance in monocular depth estimation. Particularly, traditional monocular depth estimation suffers from inherent ambiguity due to the absence of stereo or multi-view depth cues, and nuisance due to lack of robustness of vision. We argue that language prior in diffusion models can enhance monocular depth estimation by leveraging the geometric prior aligned with the language description, which is learned during text-to-image pre-training. To generate images that reflect the text properly, the model must comprehend the size and shape of specified objects, their spatial relationship, and the scale of the scene. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both image and text description that aligned with the scene to infer affine-invariant depth through a denoising process. We also show that language priors can guide the model's attention to specific regions and help it perceive the 3D scene in alignment with user intent. Simultaneously, it acts as a constraint to accelerate the convergence of the diffusion trajectory, since learning 3D properties from a condensed, low-dimensional language feature is more efficient compared with learning from a redundant, high-dimensional image feature. By training on HyperSim and Virtual KITTI, we achieve state-of-the-art zero-shot performance and a faster convergence speed, compared with other diffusion-based depth estimators, across NYUv2, KITTI, ETH3D, and ScanNet.

Title: Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)

Authors: Nasrin Imanpour, Shashwat Bajpai, Subhankar Ghosh, Sainath Reddy Sankepally, Abhilekh Borah, Hasnat Md Abdullah, Nishoak Kosaraju, Shreyas Dixit, Ashhar Aziz, Shwetangshu Biswas, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16754
Pdf URL: https://arxiv.org/pdf/2411.16754
Copy Paste: [[2411.16754]] Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)(https://arxiv.org/abs/2411.16754)
Keywords: diffusion, generative
Abstract: The proliferation of AI techniques for image generation, coupled with their increasing accessibility, has raised significant concerns about the potential misuse of these images to spread misinformation. Recent AI-generated image detection (AGID) methods include CNNDetection, NPR, DM Image Detection, Fake Image Detection, DIRE, LASTED, GAN Image Detection, AIDE, SSP, DRCT, RINE, OCC-CLIP, De-Fake, and Deep Fake Detection. However, we argue that the current state-of-the-art AGID techniques are inadequate for effectively detecting contemporary AI-generated images and advocate for a comprehensive reevaluation of these methods. We introduce the Visual Counter Turing Test (VCT^2), a benchmark comprising ~130K images generated by contemporary text-to-image models (Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and Midjourney 6). VCT^2 includes two sets of prompts sourced from tweets by the New York Times Twitter account and captions from the MS COCO dataset. We also evaluate the performance of the aforementioned AGID techniques on the VCT$^2$ benchmark, highlighting their ineffectiveness in detecting AI-generated images. As image-generative AI models continue to evolve, the need for a quantifiable framework to evaluate these models becomes increasingly critical. To meet this need, we propose the Visual AI Index (V_AI), which assesses generated images from various visual perspectives, including texture complexity and object coherence, setting a new standard for evaluating image-generative AI models. To foster research in this domain, we make our this https URL and this https URL datasets publicly available.

Title: SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Authors: Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, Alexander H. Liu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16765
Pdf URL: https://arxiv.org/pdf/2411.16765
Copy Paste: [[2411.16765]] SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction(https://arxiv.org/abs/2411.16765)
Keywords: self-supervised
Abstract: Sign language processing has traditionally relied on task-specific models,limiting the potential for transfer learning across tasks. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised transformer encoder that learns strong representations from approximately 1,000 hours of American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input, learning to predict multiple targets for corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple benchmarks. On sign language translation, it outperforms prior methods trained on publicly available data on the How2Sign (+0.7 BLEU), OpenASL (+10.0 BLEU), and FLEURS-ASL (+0.3 BLEU) benchmarks. Similarly for isolated sign language recognition, SHuBERT's accuracy surpasses that of specialized models on ASL-Citizen (+5\%) and SEM-LEX (+20.6\%), while coming close to them on WLASL2000 (-3\%). Ablation studies confirm the contribution of each component of the approach.

Title: Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background

Authors: Youngjae Cho, Gwangyeol Kim, Sirojbek Safarov, Seongdeok Bang, Jaewoo Park
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16767
Pdf URL: https://arxiv.org/pdf/2411.16767
Copy Paste: [[2411.16767]] Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background(https://arxiv.org/abs/2411.16767)
Keywords: generative, anomaly
Abstract: In anomaly detection, the scarcity of anomalous data compared to normal data poses a challenge in effectively utilizing deep neural network representations to identify anomalous features. From a data-centric perspective, generative models can solve this data imbalance issue by synthesizing anomaly datasets. Although previous research tried to enhance the controllability and quality of generating defects, they do not consider the relation between background and defect. Since the defect depends on the object's background (i.e., the normal part of an object), training only the defect area cannot utilize the background information, and even generation can be biased depending on the mask information. In addition, controlling logical anomalies should consider the dependency between background and defect areas (e.g., orange colored defect on a orange juice bottle). In this paper, our paper proposes modeling a relationship between the background and defect, where background affects denoising defects; however, the reverse is not. We introduce the regularizing term to disentangle denoising background from defects. From the disentanglement loss, we rethink defect generation with DDIM Inversion, where we generate the defect on the target normal image. Additionally, we theoretically prove that our methodology can generate a defect on the target normal image with an invariant background. We demonstrate our synthetic data is realistic and effective in several experiments.

Title: In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models

Authors: Zhi-Yi Chin, Kuan-Chen Mu, Mario Fritz, Pin-Yu Chen, Wei-Chen Chiu
Subjects: cs.LG, cs.CL, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16769
Pdf URL: https://arxiv.org/pdf/2411.16769
Copy Paste: [[2411.16769]] In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models(https://arxiv.org/abs/2411.16769)
Keywords: diffusion, in-context
Abstract: Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. While various safety mechanisms have been developed, the field lacks systematic tools for evaluating their effectiveness against real-world misuse scenarios. In this work, we propose ICER, a novel red-teaming framework that leverages Large Language Models (LLMs) and a bandit optimization-based algorithm to generate interpretable and semantic meaningful problematic prompts by learning from past successful red-teaming attempts. Our ICER efficiently probes safety mechanisms across different T2I models without requiring internal access or additional training, making it broadly applicable to deployed systems. Through extensive experiments, we demonstrate that ICER significantly outperforms existing prompt attack methods in identifying model vulnerabilities while maintaining high semantic similarity with intended content. By uncovering that successful jailbreaking instances can systematically facilitate the discovery of new vulnerabilities, our work provides crucial insights for developing more robust safety mechanisms in T2I systems.

Title: MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Authors: Feifei Shao, Ping Liu, Zhao Wang, Yawei Luo, Hongwei Wang, Jun Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16773
Pdf URL: https://arxiv.org/pdf/2411.16773
Copy Paste: [[2411.16773]] MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing(https://arxiv.org/abs/2411.16773)
Keywords: in-context
Abstract: Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable $4.1\%$ improvement in the part segmentation task and delivers consistent gains across various PCP applications.

Title: SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Authors: Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, Sandeep Chinchali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16776
Pdf URL: https://arxiv.org/pdf/2411.16776
Copy Paste: [[2411.16776]] SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models(https://arxiv.org/abs/2411.16776)
Keywords: diffusion
Abstract: In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under-represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model.

Title: NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model

Authors: Jinpeng Liu, Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Ying Shan, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16779
Pdf URL: https://arxiv.org/pdf/2411.16779
Copy Paste: [[2411.16779]] NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model(https://arxiv.org/abs/2411.16779)
Keywords: diffusion, generative
Abstract: We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.

Title: Scaling Laws for Black box Adversarial Attacks

Authors: Chuan Liu, Huanran Chen, Yichi Zhang, Yinpeng Dong, Jun Zhu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16782
Pdf URL: https://arxiv.org/pdf/2411.16782
Copy Paste: [[2411.16782]] Scaling Laws for Black box Adversarial Attacks(https://arxiv.org/abs/2411.16782)
Keywords: foundation model
Abstract: A longstanding problem of deep learning models is their vulnerability to adversarial examples, which are often generated by applying imperceptible perturbations to natural examples. Adversarial examples exhibit cross-model transferability, enabling to attack black-box models with limited information about their architectures and parameters. Model ensembling is an effective strategy to improve the transferability by attacking multiple surrogate models simultaneously. However, as prior studies usually adopt few models in the ensemble, there remains an open question of whether scaling the number of models can further improve black-box attacks. Inspired by the findings in large foundation models, we investigate the scaling laws of black-box adversarial attacks in this work. By analyzing the relationship between the number of surrogate models and transferability of adversarial examples, we conclude with clear scaling laws, emphasizing the potential of using more surrogate models to enhance adversarial transferability. Extensive experiments verify the claims on standard image classifiers, multimodal large language models, and even proprietary models like GPT-4o, demonstrating consistent scaling effects and impressive attack success rates with more surrogate models. Further studies by visualization indicate that scaled attacks bring better interpretability in semantics, indicating that the common features of models are captured.

Title: CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis

Authors: Aravindan Sundaram, Ujjayan Pal, Abhimanyu Chauhan, Aishwarya Agarwal, Srikrishna Karanam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16783
Pdf URL: https://arxiv.org/pdf/2411.16783
Copy Paste: [[2411.16783]] CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis(https://arxiv.org/abs/2411.16783)
Keywords: diffusion
Abstract: Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art.

Title: Contrastive Multi-graph Learning with Neighbor Hierarchical Sifting for Semi-supervised Text Classification

Authors: Wei Ai, Jianbin Li, Ze Wang, Yingying Wei, Tao Meng, Yuntao Shou, Keqin Lib
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16787
Pdf URL: https://arxiv.org/pdf/2411.16787
Copy Paste: [[2411.16787]] Contrastive Multi-graph Learning with Neighbor Hierarchical Sifting for Semi-supervised Text Classification(https://arxiv.org/abs/2411.16787)
Keywords: self-supervised
Abstract: Graph contrastive learning has been successfully applied in text classification due to its remarkable ability for self-supervised node representation learning. However, explicit graph augmentations may lead to a loss of semantics in the contrastive views. Secondly, existing methods tend to overlook edge features and the varying significance of node features during multi-graph learning. Moreover, the contrastive loss suffer from false negatives. To address these limitations, we propose a novel method of contrastive multi-graph learning with neighbor hierarchical sifting for semi-supervised text classification, namely ConNHS. Specifically, we exploit core features to form a multi-relational text graph, enhancing semantic connections among texts. By separating text graphs, we provide diverse views for contrastive learning. Our approach ensures optimal preservation of the graph information, minimizing data loss and distortion. Then, we separately execute relation-aware propagation and cross-graph attention propagation, which effectively leverages the varying correlations between nodes and edge features while harmonising the information fusion across graphs. Subsequently, we present the neighbor hierarchical sifting loss (NHS) to refine the negative selection. For one thing, following the homophily assumption, NHS masks first-order neighbors of the anchor and positives from being negatives. For another, NHS excludes the high-order neighbors analogous to the anchor based on their similarities. Consequently, it effectively reduces the occurrence of false negatives, preventing the expansion of the distance between similar samples in the embedding space. Our experiments on ThuCNews, SogouNews, 20 Newsgroups, and Ohsumed datasets achieved 95.86\%, 97.52\%, 87.43\%, and 70.65\%, which demonstrates competitive results in semi-supervised text classification.

Title: TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

Authors: Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16788
Pdf URL: https://arxiv.org/pdf/2411.16788
Copy Paste: [[2411.16788]] TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction(https://arxiv.org/abs/2411.16788)
Keywords: diffusion
Abstract: We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-ofthe-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted

Title: From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task

Authors: Bohao Chen, Yanchao Zhang, Yanan Lv, Hua Han, Xi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16792
Pdf URL: https://arxiv.org/pdf/2411.16792
Copy Paste: [[2411.16792]] From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task(https://arxiv.org/abs/2411.16792)
Keywords: diffusion
Abstract: Diffusion models have recently emerged as a powerful technique in image generation, especially for image super-resolution tasks. While 2D diffusion models significantly enhance the resolution of individual images, existing diffusion-based methods for 3D volume super-resolution often struggle with structure discontinuities in axial direction and high sampling costs. In this work, we present a novel approach that leverages the 2D diffusion model and lateral continuity within the volume to enhance 3D volume electron microscopy (vEM) super-resolution. We first simulate lateral degradation with slices in the XY plane and train a 2D diffusion model to learn how to restore the degraded slices. The model is then applied slice-by-slice in the lateral direction of low-resolution volume, recovering slices while preserving inherent lateral continuity. Following this, a high-frequency-aware 3D super-resolution network is trained on the recovery lateral slice sequences to learn spatial feature transformation across slices. Finally, the network is applied to infer high-resolution volumes in the axial direction, enabling 3D super-resolution. We validate our approach through comprehensive evaluations, including image similarity assessments, resolution analysis, and performance on downstream tasks. Our results on two publicly available focused ion beam scanning electron microscopy (FIB-SEM) datasets demonstrate the robustness and practical applicability of our framework for 3D volume super-resolution.

Title: ST-Align: A Multimodal Foundation Model for Image-Gene Alignment in Spatial Transcriptomics

Authors: Yuxiang Lin, Ling Luo, Ying Chen, Xushi Zhang, Zihui Wang, Wenxian Yang, Mengsha Tong, Rongshan Yu
Subjects: cs.CV, q-bio.GN
Abstract URL: https://arxiv.org/abs/2411.16793
Pdf URL: https://arxiv.org/pdf/2411.16793
Copy Paste: [[2411.16793]] ST-Align: A Multimodal Foundation Model for Image-Gene Alignment in Spatial Transcriptomics(https://arxiv.org/abs/2411.16793)
Keywords: foundation model
Abstract: Spatial transcriptomics (ST) provides high-resolution pathological images and whole-transcriptomic expression profiles at individual spots across whole-slide scales. This setting makes it an ideal data source to develop multimodal foundation models. Although recent studies attempted to fine-tune visual encoders with trainable gene encoders based on spot-level, the absence of a wider slide perspective and spatial intrinsic relationships limits their ability to capture ST-specific insights effectively. Here, we introduce ST-Align, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features. We design a novel pretraining framework with a three-target alignment strategy for ST-Align, enabling (1) multi-scale alignment across image-gene pairs, capturing both spot- and niche-level contexts for a comprehensive perspective, and (2) cross-level alignment of multimodal insights, connecting localized cellular characteristics and broader tissue architecture. Additionally, ST-Align employs specialized encoders tailored to distinct ST contexts, followed by an Attention-Based Fusion Network (ABFN) for enhanced multimodal fusion, effectively merging domain-shared knowledge with ST-specific insights from both pathological and genomic data. We pre-trained ST-Align on 1.3 million spot-niche pairs and evaluated its performance through two downstream tasks across six datasets, demonstrating superior zero-shot and few-shot capabilities. ST-Align highlights the potential for reducing the cost of ST and providing valuable insights into the distinction of critical compositions within human tissue.

Title: Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery

Authors: Bhuvan Sachdeva, Naren Akash, Tajamul Ashraf, Simon Muller, Thomas Schultz, Maximilian W. M. Wintergerst, Niharika Singri Prasad, Kaushik Murali, Mohit Jain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16794
Pdf URL: https://arxiv.org/pdf/2411.16794
Copy Paste: [[2411.16794]] Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery(https://arxiv.org/abs/2411.16794)
Keywords: foundation model
Abstract: Cataract surgery is the most common surgical procedure globally, with a disproportionately higher burden in developing countries. While automated surgical video analysis has been explored in general surgery, its application to ophthalmic procedures remains limited. Existing works primarily focus on Phaco cataract surgery, an expensive technique not accessible in regions where cataract treatment is most needed. In contrast, Manual Small-Incision Cataract Surgery (MSICS) is the preferred low-cost, faster alternative in high-volume settings and for challenging cases. However, no dataset exists for MSICS. To address this gap, we introduce Cataract-MSICS, the first comprehensive dataset containing 53 surgical videos annotated for 18 surgical phases and 3,527 frames with 13 surgical tools at the pixel level. We benchmark this dataset on state-of-the-art models and present ToolSeg, a novel framework that enhances tool segmentation by introducing a phase-conditional decoder and a simple yet effective semi-supervised setup leveraging pseudo-labels from foundation models. Our approach significantly improves segmentation performance, achieving a $23.77\%$ to $38.10\%$ increase in mean Dice scores, with a notable boost for tools that are less prevalent and small. Furthermore, we demonstrate that ToolSeg generalizes to other surgical settings, showcasing its effectiveness on the CaDIS dataset.

Title: Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image

Authors: Jiajing Lin, Zhenzhong Wang, Shu Jiang, Yongjie Hou, Min Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16800
Pdf URL: https://arxiv.org/pdf/2411.16800
Copy Paste: [[2411.16800]] Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image(https://arxiv.org/abs/2411.16800)
Keywords: diffusion
Abstract: The task of 4D content generation involves creating dynamic 3D models that evolve over time in response to specific input conditions, such as images. Existing methods rely heavily on pre-trained video diffusion models to guide 4D content dynamics, but these approaches often fail to capture essential physical principles, as video diffusion models lack a robust understanding of real-world physics. Moreover, these models face challenges in providing fine-grained control over dynamics and exhibit high computational costs. In this work, we propose Phys4DGen, a novel, high-efficiency framework that generates physics-compliant 4D content from a single image with enhanced control capabilities. Our approach uniquely integrates physical simulations into the 4D generation pipeline, ensuring adherence to fundamental physical laws. Inspired by the human ability to infer physical properties visually, we introduce a Physical Perception Module (PPM) that discerns the material properties and structural components of the 3D object from the input image, facilitating accurate downstream simulations. Phys4DGen significantly accelerates the 4D generation process by eliminating iterative optimization steps in the dynamics modeling phase. It allows users to intuitively control the movement speed and direction of generated 4D content by adjusting external forces, achieving finely tunable, physically plausible animations. Extensive evaluations show that Phys4DGen outperforms existing methods in both inference speed and physical realism, producing high-quality, controllable 4D content.

Title: Controllable Human Image Generation with Personalized Multi-Garments

Authors: Yisol Choi, Sangkyung Kwak, Sihyun Yu, Hyungwon Choi, Jinwoo Shin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16801
Pdf URL: https://arxiv.org/pdf/2411.16801
Copy Paste: [[2411.16801]] Controllable Human Image Generation with Personalized Multi-Garments(https://arxiv.org/abs/2411.16801)
Keywords: diffusion
Abstract: We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.

Title: Abnormality-Driven Representation Learning for Radiology Imaging

Authors: Marta Ligero, Tim Lenz, Georg Wölflein, Omar S.M. El Nahhas, Daniel Truhn, Jakob Nikolas Kather
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16803
Pdf URL: https://arxiv.org/pdf/2411.16803
Copy Paste: [[2411.16803]] Abnormality-Driven Representation Learning for Radiology Imaging(https://arxiv.org/abs/2411.16803)
Keywords: self-supervised, foundation model
Abstract: To date, the most common approach for radiology deep learning pipelines is the use of end-to-end 3D networks based on models pre-trained on other tasks, followed by fine-tuning on the task at hand. In contrast, adjacent medical fields such as pathology, which focus on 2D images, have effectively adopted task-agnostic foundational models based on self-supervised learning (SSL), combined with weakly-supervised deep learning (DL). However, the field of radiology still lacks task-agnostic representation models due to the computational and data demands of 3D imaging and the anatomical complexity inherent to radiology scans. To address this gap, we propose CLEAR, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints. As part of this framework, we introduce lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans. Specifically, we trained single-domain contrastive learning approaches using three different architectures: Vision Transformers, Vision State Space Models and Gated Convolutional Neural Networks. We evaluate our approach across three clinical tasks: tumor lesion location, lung disease detection, and patient staging, benchmarking against four state-of-the-art foundation models, including BiomedCLIP. Our findings demonstrate that CLEAR using representations learned through LeCL, outperforms existing foundation models, while being substantially more compute- and data-efficient.

Title: Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation

Authors: Shengeng Tang, Jiayi He, Lechao Cheng, Jingjing Wu, Dan Guo, Richang Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16810
Pdf URL: https://arxiv.org/pdf/2411.16810
Copy Paste: [[2411.16810]] Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation(https://arxiv.org/abs/2411.16810)
Keywords: diffusion
Abstract: Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.

Title: Pathways on the Image Manifold: Image Editing via Video Generation

Authors: Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16819
Pdf URL: https://arxiv.org/pdf/2411.16819
Copy Paste: [[2411.16819]] Pathways on the Image Manifold: Image Editing via Video Generation(https://arxiv.org/abs/2411.16819)
Keywords: diffusion
Abstract: Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.

Title: DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow

Authors: Ken Deng, Yuanchen Guo, Jingxiang Sun, Zixin Zou, Yangguang Li, Xin Cai, Yanpei Cao, Yebin Liu, Ding Liang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16820
Pdf URL: https://arxiv.org/pdf/2411.16820
Copy Paste: [[2411.16820]] DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow(https://arxiv.org/abs/2411.16820)
Keywords: generative
Abstract: Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.

Title: Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

Authors: Hanhui Wang, Yihua Zhang, Ruizheng Bai, Yue Zhao, Sijia Liu, Zhengzhong Tu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16832
Pdf URL: https://arxiv.org/pdf/2411.16832
Copy Paste: [[2411.16832]] Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing(https://arxiv.org/abs/2411.16832)
Keywords: diffusion, generative
Abstract: Recent advancements in diffusion models have made generative image editing more accessible, enabling creative edits but raising ethical concerns, particularly regarding malicious edits to human portraits that threaten privacy and identity security. Existing protection methods primarily rely on adversarial perturbations to nullify edits but often fail against diverse editing requests. We propose FaceLock, a novel approach to portrait protection that optimizes adversarial perturbations to destroy or significantly alter biometric information, rendering edited outputs biometrically unrecognizable. FaceLock integrates facial recognition and visual perception into perturbation optimization to provide robust protection against various editing attempts. We also highlight flaws in commonly used evaluation metrics and reveal how they can be manipulated, emphasizing the need for reliable assessments of protection. Experiments show FaceLock outperforms baselines in defending against malicious edits and is robust against purification techniques. Ablation studies confirm its stability and broad applicability across diffusion-based editing algorithms. Our work advances biometric defense and sets the foundation for privacy-preserving practices in image editing. The code is available at: this https URL.

Title: Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Authors: Andong Deng, Zhongpai Gao, Anwesa Choudhuri, Benjamin Planche, Meng Zheng, Bin Wang, Terrence Chen, Chen Chen, Ziyan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16932
Pdf URL: https://arxiv.org/pdf/2411.16932
Copy Paste: [[2411.16932]] Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding(https://arxiv.org/abs/2411.16932)
Keywords: self-supervised
Abstract: Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6% improvement in F1 score and 44.8% in CIDEr on the YouCook2 benchmark and a 14.7% increase in recall on the Charades-STA benchmark compared to the baseline.

Title: MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning

Authors: Yuming Feng, Zhiyang Dou, Ling-Hao Chen, Yuan Liu, Tianyu Li, Jingbo Wang, Zeyu Cao, Wenping Wang, Taku Komura, Lingjie Liu
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2411.16964
Pdf URL: https://arxiv.org/pdf/2411.16964
Copy Paste: [[2411.16964]] MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning(https://arxiv.org/abs/2411.16964)
Keywords: diffusion
Abstract: Modeling temporal characteristics and the non-stationary dynamics of body movement plays a significant role in predicting human future motions. However, it is challenging to capture these features due to the subtle transitions involved in the complex human motions. This paper introduces MotionWavelet, a human motion prediction framework that utilizes Wavelet Transformation and studies human motion patterns in the spatial-frequency domain. In MotionWavelet, a Wavelet Diffusion Model (WDM) learns a Wavelet Manifold by applying Wavelet Transformation on the motion data therefore encoding the intricate spatial and temporal motion patterns. Once the Wavelet Manifold is built, WDM trains a diffusion model to generate human motions from Wavelet latent vectors. In addition to the WDM, MotionWavelet also presents a Wavelet Space Shaping Guidance mechanism to refine the denoising process to improve conformity with the manifold structure. WDM also develops Temporal Attention-Based Guidance to enhance prediction accuracy. Extensive experiments validate the effectiveness of MotionWavelet, demonstrating improved prediction accuracy and enhanced generalization across various benchmarks. Our code and models will be released upon acceptance.

Title: ZoomLDM: Latent Diffusion Model for multi-scale image generation

Authors: Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Prateek Prasanna, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16969
Pdf URL: https://arxiv.org/pdf/2411.16969
Copy Paste: [[2411.16969]] ZoomLDM: Latent Diffusion Model for multi-scale image generation(https://arxiv.org/abs/2411.16969)
Keywords: diffusion, self-supervised, generative
Abstract: Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments. We provide high-resolution examples of the generated images on our website this https URL.

Title: SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery

Authors: Caleb S. Spradlin, Jordan A. Caraballo-Vega, Jian Li, Mark L. Carroll, Jie Gong, Paul M. Montesano
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17000
Pdf URL: https://arxiv.org/pdf/2411.17000
Copy Paste: [[2411.17000]] SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery(https://arxiv.org/abs/2411.17000)
Keywords: self-supervised, foundation model
Abstract: Foundation models have the potential to transform the landscape of remote sensing (RS) data analysis by enabling large computer vision models to be pre-trained on vast amounts of remote sensing data. These models can then be fine-tuned with small amounts of labeled training and applied to a variety of applications. Most existing foundation models are designed for high spatial resolution, cloud-free satellite imagery or photos, limiting their applicability in scenarios that require frequent temporal monitoring or broad spectral profiles. As a result, foundation models trained solely on cloud-free images have limited utility for applications that involve atmospheric variables or require atmospheric corrections. We introduce SatVision-TOA, a novel foundation model pre-trained on 14-band MODIS L1B Top-Of-Atmosphere (TOA) radiance imagery, addressing the need for models pre-trained to handle moderate- and coarse-resolution all-sky remote sensing data. The SatVision-TOA model is pre-trained using a Masked-Image-Modeling (MIM) framework and the SwinV2 architecture, and learns detailed contextual representations through self-supervised learning without the need for labels. It is a 3 billion parameter model that is trained on 100 million images. To our knowledge this is the largest foundation model trained solely on satellite RS imagery. Results show that SatVision-TOA achieves superior performance over baseline methods on downstream tasks such as 3D cloud retrieval. Notably, the model achieves a mean intersection over union (mIOU) of 0.46, a substantial improvement over the baseline mIOU of 0.22. Additionally, the rate of false negative results in the fine-tuning task were reduced by over 50% compared to the baseline. Our work advances pre-trained vision modeling for multispectral RS by learning from a variety of atmospheric and aerosol conditions to improve cloud and land surface monitoring.

Title: Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

Authors: Shambhavi Mishra, Julio Silva-Rodrıguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17002
Pdf URL: https://arxiv.org/pdf/2411.17002
Copy Paste: [[2411.17002]] Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation(https://arxiv.org/abs/2411.17002)
Keywords: foundation model
Abstract: Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is signifi- cantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribu- tion drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed cen- troids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representa- tion learning but without incurring additional computational complexity. Extensive experiments on multiple popular test- time adaptation benchmarks presenting diverse complex- ity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.

Title: TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

Authors: Zhenchen Wan, Yanwu Xu, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17017
Pdf URL: https://arxiv.org/pdf/2411.17017
Copy Paste: [[2411.17017]] TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On(https://arxiv.org/abs/2411.17017)
Keywords: diffusion, generative
Abstract: Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models' advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task.

Title: D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow

Authors: Haiming Zhang, Xu Yan, Ying Xue, Zixuan Guo, Shuguang Cui, Zhen Li, Bingbing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17027
Pdf URL: https://arxiv.org/pdf/2411.17027
Copy Paste: [[2411.17027]] D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow(https://arxiv.org/abs/2411.17027)
Keywords: foundation model
Abstract: This technical report summarizes the second-place solution for the Predictive World Model Challenge held at the CVPR-2024 Workshop on Foundation Models for Autonomous Systems. We introduce D$^2$-World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow. Specifically, the past semantic occupancies are obtained via existing occupancy networks (e.g., BEVDet). Following this, the occupancy results serve as the input for a single-stage world model, generating future occupancy in a non-autoregressive manner. To further simplify the task, dynamic voxel decoupling is performed in the world model. The model generates future dynamic voxels by warping the existing observations through voxel flow, while remaining static voxels can be easily obtained through pose transformation. As a result, our approach achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model. Code is available at this https URL.

Title: Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

Authors: Jaemin Kim, Bryan S Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17041
Pdf URL: https://arxiv.org/pdf/2411.17041
Copy Paste: [[2411.17041]] Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models(https://arxiv.org/abs/2411.17041)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free$^2$Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free$^2$Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

Title: A generalised novel loss function for computational fluid dynamics

Authors: Zachary Cooper-Baldock, Paulo E. Santos, Russell S.A. Brinkworth, Karl Sammut
Subjects: cs.LG, cs.CV, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2411.17059
Pdf URL: https://arxiv.org/pdf/2411.17059
Copy Paste: [[2411.17059]] A generalised novel loss function for computational fluid dynamics(https://arxiv.org/abs/2411.17059)
Keywords: generative
Abstract: Computational fluid dynamics (CFD) simulations are crucial in automotive, aerospace, maritime and medical applications, but are limited by the complexity, cost and computational requirements of directly calculating the flow, often taking days of compute time. Machine-learning architectures, such as controlled generative adversarial networks (cGANs) hold significant potential in enhancing or replacing CFD investigations, due to cGANs ability to approximate the underlying data distribution of a dataset. Unlike traditional cGAN applications, where the entire image carries information, CFD data contains small regions of highly variant data, immersed in a large context of low variance that is of minimal importance. This renders most existing deep learning techniques that give equal importance to every portion of the data during training, inefficient. To mitigate this, a novel loss function is proposed called Gradient Mean Squared Error (GMSE) which automatically and dynamically identifies the regions of importance on a field-by-field basis, assigning appropriate weights according to the local variance. To assess the effectiveness of the proposed solution, three identical networks were trained; optimised with Mean Squared Error (MSE) loss, proposed GMSE loss and a dynamic variant of GMSE (DGMSE). The novel loss function resulted in faster loss convergence, correlating to reduced training time, whilst also displaying an 83.6% reduction in structural similarity error between the generated field and ground truth simulations, a 76.6% higher maximum rate of loss and an increased ability to fool a discriminator network. It is hoped that this loss function will enable accelerated machine learning within computational fluid dynamics.

Title: Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning

Authors: Xinyi Gao, Yayong Li, Tong Chen, Guanhua Ye, Wentao Zhang, Hongzhi Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17063
Pdf URL: https://arxiv.org/pdf/2411.17063
Copy Paste: [[2411.17063]] Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning(https://arxiv.org/abs/2411.17063)
Keywords: self-supervised
Abstract: With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes' positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.

Title: Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models

Authors: Colin Conwell, Rupert Tawiah-Quashie, Tomer Ullman
Subjects: cs.CV, cs.CL, cs.SC
Abstract URL: https://arxiv.org/abs/2411.17066
Pdf URL: https://arxiv.org/pdf/2411.17066
Copy Paste: [[2411.17066]] Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models(https://arxiv.org/abs/2411.17066)
Keywords: diffusion, generative
Abstract: Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these `logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a `grounded diffusion' pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further clarity on potential sources of success and failure in these text-to-image systems, we supplement our 4 core experiments with multiple auxiliary analyses and schematic diagrams, directly quantifying, for example, the relationship between the N-gram frequency of relational prompts and the average match to generated images; the success rates for 3 different prompt modification strategies in the rendering of negation prompts; and the scalar variability / ratio dependence (`approximate numeracy') of prompts involving integers. We conclude by discussing the limitations inherent to `grounded' multimodal learning systems whose grounding relies heavily on vector-based semantics (e.g. DALL-E 3), or under-specified syntactical constraints (e.g. `grounded diffusion'), and propose minimal modifications (inspired by development, based in imagery) that could help to bridge the lingering compositional gap between scale and structure. All data and code is available at this https URL

Title: Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts

Authors: Jinho Chang, Hyungjin Chung, Jong Chul Ye
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17077
Pdf URL: https://arxiv.org/pdf/2411.17077
Copy Paste: [[2411.17077]] Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts(https://arxiv.org/abs/2411.17077)
Keywords: diffusion
Abstract: As Classifier-Free Guidance (CFG) has proven effective in conditional diffusion model sampling for improved condition alignment, many applications use a negated CFG term to filter out unwanted features from samples. However, simply negating CFG guidance creates an inverted probability distribution, often distorting samples away from the marginal distribution. Inspired by recent advances in conditional diffusion models for inverse problems, here we present a novel method to enhance negative CFG guidance using contrastive loss. Specifically, our guidance term aligns or repels the denoising direction based on the given condition through contrastive loss, achieving a nearly identical guiding direction to traditional CFG for positive guidance while overcoming the limitations of existing negative guidance methods. Experimental results demonstrate that our approach effectively removes undesirable concepts while maintaining sample quality across diverse scenarios, from simple class conditions to complex and overlapping text prompts.

Title: PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution

Authors: Libo Zhu, Jianze Li, Haotong Qin, Yulun Zhang, Yong Guo, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17106
Pdf URL: https://arxiv.org/pdf/2411.17106
Copy Paste: [[2411.17106]] PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution(https://arxiv.org/abs/2411.17106)
Keywords: diffusion
Abstract: Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be at this https URL.

Title: Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Authors: Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17150
Pdf URL: https://arxiv.org/pdf/2411.17150
Copy Paste: [[2411.17150]] Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2411.17150)
Keywords: foundation model
Abstract: Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object-level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models' ability to group semantically consistent elements within object and map them precisely to user-defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images. Specifically, our model enhances intra-object consistency by distilling spectral-driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero-shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object-level contextual knowledge, our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets.

Title: OSDFace: One-Step Diffusion Model for Face Restoration

Authors: Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17163
Pdf URL: https://arxiv.org/pdf/2411.17163
Copy Paste: [[2411.17163]] OSDFace: One-Step Diffusion Model for Face Restoration(https://arxiv.org/abs/2411.17163)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject's identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released at this https URL.

Title: ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Authors: Chengyou Jia, Changliang Xia, Zhuohang Dang, Weijia Wu, Hangwei Qian, Minnan Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17176
Pdf URL: https://arxiv.org/pdf/2411.17176
Copy Paste: [[2411.17176]] ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting(https://arxiv.org/abs/2411.17176)
Keywords: generative
Abstract: Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \url{this https URL}

Title: LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

Authors: Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17178
Pdf URL: https://arxiv.org/pdf/2411.17178
Copy Paste: [[2411.17178]] LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization(https://arxiv.org/abs/2411.17178)
Keywords: diffusion
Abstract: Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.

Title: PhysMotion: Physics-Grounded Dynamics From a Single Image

Authors: Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, Chenfanfu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17189
Pdf URL: https://arxiv.org/pdf/2411.17189
Copy Paste: [[2411.17189]] PhysMotion: Physics-Grounded Dynamics From a Single Image(https://arxiv.org/abs/2411.17189)
Keywords: diffusion, generative
Abstract: We introduce PhysMotion, a novel framework that leverages principled physics-based simulations to guide intermediate 3D representations generated from a single image and input conditions (e.g., applied force and torque), producing high-quality, physically plausible video generation. By utilizing continuum mechanics-based simulations as a prior knowledge, our approach addresses the limitations of traditional data-driven generative models and result in more consistent physically plausible motions. Our framework begins by reconstructing a feed-forward 3D Gaussian from a single image through geometry optimization. This representation is then time-stepped using a differentiable Material Point Method (MPM) with continuum mechanics-based elastoplasticity models, which provides a strong foundation for realistic dynamics, albeit at a coarse level of detail. To enhance the geometry, appearance and ensure spatiotemporal consistency, we refine the initial simulation using a text-to-image (T2I) diffusion model with cross-frame attention, resulting in a physically plausible video that retains intricate details comparable to the input image. We conduct comprehensive qualitative and quantitative evaluations to validate the efficacy of our method. Our project page is available at: \url{this https URL}.

Title: SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting

Authors: Gyeongjin Kang, Jisang Yoo, Jihyeon Park, Seungtae Nam, Hyeonsoo Im, Shin sangheon, Sangpil Kim, Eunbyung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17190
Pdf URL: https://arxiv.org/pdf/2411.17190
Copy Paste: [[2411.17190]] SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting(https://arxiv.org/abs/2411.17190)
Keywords: self-supervised
Abstract: We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images. These settings are inherently ill-posed due to the lack of ground-truth data, learned geometric information, and the need to achieve accurate 3D reconstruction without finetuning, making it difficult for conventional methods to achieve high-quality results. Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques, resulting in reciprocal improvements in both pose accuracy and 3D reconstruction quality. Furthermore, we incorporate a matching-aware pose estimation network and a depth refinement module to enhance geometry consistency across views, ensuring more accurate and stable 3D reconstructions. To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV. SelfSplat achieves superior results over previous state-of-the-art methods in both appearance and geometry quality, also demonstrates strong cross-dataset generalization capabilities. Extensive ablation studies and analysis also validate the effectiveness of our proposed methods. Code and pretrained models are available at this https URL

Title: Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning

Authors: Hui-Yue Yang, Hui Chen, Ao Wang, Kai Chen, Zijia Lin, Yongliang Tang, Pengcheng Gao, Yuming Quan, Jungong Han, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17217
Pdf URL: https://arxiv.org/pdf/2411.17217
Copy Paste: [[2411.17217]] Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning(https://arxiv.org/abs/2411.17217)
Keywords: anomaly
Abstract: Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perceptinon Tuning (SPT) method, aiming to enhance SAM's perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness. Models and codes will be available online.

Title: GraphSubDetector: Time Series Subsequence Anomaly Detection via Density-Aware Adaptive Graph Neural Network

Authors: Weiqi Chen, Zhiqiang Zhou, Qingsong Wen, Liang Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17218
Pdf URL: https://arxiv.org/pdf/2411.17218
Copy Paste: [[2411.17218]] GraphSubDetector: Time Series Subsequence Anomaly Detection via Density-Aware Adaptive Graph Neural Network(https://arxiv.org/abs/2411.17218)
Keywords: anomaly
Abstract: Time series subsequence anomaly detection is an important task in a large variety of real-world applications ranging from health monitoring to AIOps, and is challenging due to the following reasons: 1) how to effectively learn complex dynamics and dependencies in time series; 2) diverse and complicated anomalous subsequences as well as the inherent variance and noise of normal patterns; 3) how to determine the proper subsequence length for effective detection, which is a required parameter for many existing algorithms. In this paper, we present a novel approach to subsequence anomaly detection, namely GraphSubDetector. First, it adaptively learns the appropriate subsequence length with a length selection mechanism that highlights the characteristics of both normal and anomalous patterns. Second, we propose a density-aware adaptive graph neural network (DAGNN), which can generate further robust representations against variance of normal data for anomaly detection by message passing between subsequences. The experimental results demonstrate the effectiveness of the proposed algorithm, which achieves superior performance on multiple time series anomaly benchmark datasets compared to state-of-the-art algorithms.

Title: DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

Authors: Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17223
Pdf URL: https://arxiv.org/pdf/2411.17223
Copy Paste: [[2411.17223]] DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting(https://arxiv.org/abs/2411.17223)
Keywords: diffusion, generative
Abstract: Subject-driven image inpainting has emerged as a popular task in image editing alongside recent advancements in diffusion models. Previous methods primarily focus on identity preservation but struggle to maintain the editability of inserted objects. In response, this paper introduces DreamMix, a diffusion-based generative model adept at inserting target objects into given scenes at user-specified locations while concurrently enabling arbitrary text-driven modifications to their attributes. In particular, we leverage advanced foundational inpainting models and introduce a disentangled local-global inpainting framework to balance precise local object insertion with effective global visual coherence. Additionally, we propose an Attribute Decoupling Mechanism (ADM) and a Textual Attribute Substitution (TAS) module to improve the diversity and discriminative capability of the text-based attribute guidance, respectively. Extensive experiments demonstrate that DreamMix effectively balances identity preservation and attribute editability across various application scenarios, including object insertion, attribute editing, and small object inpainting. Our code is publicly available at this https URL.

Title: From Graph Diffusion to Graph Classification

Authors: Jia Jun Cheng Xian, Sadegh Mahdavi, Renjie Liao, Oliver Schulte
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17236
Pdf URL: https://arxiv.org/pdf/2411.17236
Copy Paste: [[2411.17236]] From Graph Diffusion to Graph Classification(https://arxiv.org/abs/2411.17236)
Keywords: diffusion, generative
Abstract: Generative models such as diffusion models have achieved remarkable success in state-of-the-art image and text tasks. Recently, score-based diffusion models have extended their success beyond image generation, showing competitive performance with discriminative methods in image {\em classification} tasks~\cite{zimmermann2021score}. However, their application to classification in the {\em graph} domain, which presents unique challenges such as complex topologies, remains underexplored. We show how graph diffusion models can be applied for graph classification. We find that to achieve competitive classification accuracy, score-based graph diffusion models should be trained with a novel training objective that is tailored to graph classification. In experiments with a sampling-based inference method, our discriminative training objective achieves state-of-the-art graph classification accuracy.

Title: Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Authors: Junyuan Deng, Wei Yin, Xiaoyang Guo, Qian Zhang, Xiaotao Hu, Weiqiang Ren, Xiaoxiao Long, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17240
Pdf URL: https://arxiv.org/pdf/2411.17240
Copy Paste: [[2411.17240]] Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration(https://arxiv.org/abs/2411.17240)
Keywords: diffusion
Abstract: In this paper, we present DM-Calib, a diffusion-based approach for estimating pin- hole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, re- sulting in poor generalization across diverse real-world images. Recent advance- ments in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence in- dicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Im- age, which losslessly encodes the numerical camera intrinsics and integrates seam- lessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Im- age conditioned on an input image. By fine-tuning a stable diffusion model to gen- erate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach signifi- cantly outperforms baselines and provides broad benefits to 3D vision tasks. Code is available at this https URL.

Title: DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Authors: JiHwan Moon, Jihoon Park, Jungeun Kim, Jongseong Bae, Hyeongwoo Jeon, Ha Young Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17248
Pdf URL: https://arxiv.org/pdf/2411.17248
Copy Paste: [[2411.17248]] DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model(https://arxiv.org/abs/2411.17248)
Keywords: diffusion
Abstract: Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverages a diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which fully utilizes the multi-level spatiotemporal information of the visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over previous gloss-free SLT methods and achieve state-of-the-art performance on two SLT datasets, thereby markedly improving translation quality.

Title: APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents

Authors: Jun Yu Chen, Tao Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17255
Pdf URL: https://arxiv.org/pdf/2411.17255
Copy Paste: [[2411.17255]] APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents(https://arxiv.org/abs/2411.17255)
Keywords: diffusion
Abstract: We present APT, an advanced Large Language Model (LLM)-driven framework that enables autonomous agents to construct complex and creative structures within the Minecraft environment. Unlike previous approaches that primarily concentrate on skill-based open-world tasks or rely on image-based diffusion models for generating voxel-based structures, our method leverages the intrinsic spatial reasoning capabilities of LLMs. By employing chain-of-thought decomposition along with multimodal inputs, the framework generates detailed architectural layouts and blueprints that the agent can execute under zero-shot or few-shot learning scenarios. Our agent incorporates both memory and reflection modules to facilitate lifelong learning, adaptive refinement, and error correction throughout the building process. To rigorously evaluate the agent's performance in this emerging research area, we introduce a comprehensive benchmark consisting of diverse construction tasks designed to test creativity, spatial reasoning, adherence to in-game rules, and the effective integration of multimodal instructions. Experimental results using various GPT-based LLM backends and agent configurations demonstrate the agent's capacity to accurately interpret extensive instructions involving numerous items, their positions, and orientations. The agent successfully produces complex structures complete with internal functionalities such as Redstone-powered systems. A/B testing indicates that the inclusion of a memory module leads to a significant increase in performance, emphasizing its role in enabling continuous learning and the reuse of accumulated experience. Additionally, the agent's unexpected emergence of scaffolding behavior highlights the potential of future LLM-driven agents to utilize subroutine planning and leverage the emergence ability of LLMs to autonomously develop human-like problem-solving techniques.

Title: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling

Authors: Alexander Capstick, Rahul G. Krishnan, Payam Barnaghi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17284
Pdf URL: https://arxiv.org/pdf/2411.17284
Copy Paste: [[2411.17284]] Using Large Language Models for Expert Prior Elicitation in Predictive Modelling(https://arxiv.org/abs/2411.17284)
Keywords: in-context
Abstract: Large language models (LLMs), trained on diverse data effectively acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency hinder their direct application for specialised tasks. In fields such as clinical research, acquiring expert annotations or prior knowledge about predictive models is often costly and time-consuming. This study proposes using LLMs to elicit expert prior distributions for predictive models. This approach also provides an alternative to in-context learning, where language models are tasked with making predictions directly. We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation. Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings. Applied to clinical problems, this translates to fewer required biological samples, lowering cost and resources. Prior elicitation also consistently outperforms and proves more reliable than in-context learning at a lower cost, making it a preferred alternative in our setting. We demonstrate the utility of this method across various use cases, including clinical applications. For infection prediction, using LLM-elicited priors reduced the number of required labels to achieve the same accuracy as an uninformative prior by 55%, at 200 days earlier in the study.

Title: Reward Incremental Learning in Text-to-Image Generation

Authors: Maorong Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17310
Pdf URL: https://arxiv.org/pdf/2411.17310
Copy Paste: [[2411.17310]] Reward Incremental Learning in Text-to-Image Generation(https://arxiv.org/abs/2411.17310)
Keywords: diffusion
Abstract: The recent success of denoising diffusion models has significantly advanced text-to-image generation. While these large-scale pretrained models show excellent performance in general image synthesis, downstream objectives often require fine-tuning to meet specific criteria such as aesthetics or human preference. Reward gradient-based strategies are promising in this context, yet existing methods are limited to single-reward tasks, restricting their applicability in real-world scenarios that demand adapting to multiple objectives introduced incrementally over time. In this paper, we first define this more realistic and unexplored problem, termed Reward Incremental Learning (RIL), where models are desired to adapt to multiple downstream objectives incrementally. Additionally, while the models adapt to the ever-emerging new objectives, we observe a unique form of catastrophic forgetting in diffusion model fine-tuning, affecting both metric-wise and visual structure-wise image quality. To address this catastrophic forgetting challenge, we propose Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios. The source code of our work will be publicly available upon acceptance.

Title: InsightEdit: Towards Better Instruction Following for Image Editing

Authors: Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, Qiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17323
Pdf URL: https://arxiv.org/pdf/2411.17323
Copy Paste: [[2411.17323]] InsightEdit: Towards Better Instruction Following for Image Editing(https://arxiv.org/abs/2411.17323)
Keywords: diffusion
Abstract: In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.

Title: The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations

Authors: Theodora Worledge, Tatsunori Hashimoto, Carlos Guestrin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17375
Pdf URL: https://arxiv.org/pdf/2411.17375
Copy Paste: [[2411.17375]] The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations(https://arxiv.org/abs/2411.17375)
Keywords: in-context
Abstract: Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high-stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information-sharing tools, we introduce the extractive-abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in-context sources without reliable citation. We define five operating points that span the extractive-abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real-world QA settings: web search, language simplification, multi-step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain-specific LLM systems and our failure analysis informs approaches to high-utility LLM systems that empower users to verify information.

Title: RealTraj: Towards Real-World Pedestrian Trajectory Forecasting

Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17376
Pdf URL: https://arxiv.org/pdf/2411.17376
Copy Paste: [[2411.17376]] RealTraj: Towards Real-World Pedestrian Trajectory Forecasting(https://arxiv.org/abs/2411.17376)
Keywords: self-supervised
Abstract: This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases--self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data--to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant in tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, significantly reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.

Title: AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

Authors: Ziyi Xu, Ziyao Huang, Juan Cao, Yong Zhang, Xiaodong Cun, Qing Shuai, Yuchen Wang, Linchao Bao, Jintao Li, Fan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17383
Pdf URL: https://arxiv.org/pdf/2411.17383
Copy Paste: [[2411.17383]] AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation(https://arxiv.org/abs/2411.17383)
Keywords: diffusion
Abstract: The automatic generation of anchor-style product promotion videos presents promising opportunities in online commerce, advertising, and consumer engagement. However, this remains a challenging task despite significant advancements in pose-guided human video generation. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Additionally, we introduce the HOI-region reweighting loss, a training objective that enhances the learning of object details. Extensive experiments demonstrate that our proposed system outperforms existing methods in preserving object appearance and shape awareness, while simultaneously maintaining consistency in human appearance and motion. Project page: this https URL

Title: One Mind, Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models

Authors: Pengfei Cao, Yuheng Chen, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17401
Pdf URL: https://arxiv.org/pdf/2411.17401
Copy Paste: [[2411.17401]] One Mind, Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models(https://arxiv.org/abs/2411.17401)
Keywords: self-supervised
Abstract: Large language models (LLMs) have learned vast amounts of factual knowledge through self-supervised pre-training on large-scale corpora. Meanwhile, LLMs have also demonstrated excellent multilingual capabilities, which can express the learned knowledge in multiple languages. However, the knowledge storage mechanism in LLMs still remains mysterious. Some researchers attempt to demystify the factual knowledge in LLMs from the perspective of knowledge neurons, and subsequently discover language-agnostic knowledge neurons that store factual knowledge in a form that transcends language barriers. However, the preliminary finding suffers from two limitations: 1) High Uncertainty in Localization Results. Existing study only uses a prompt-based probe to localize knowledge neurons for each fact, while LLMs cannot provide consistent answers for semantically equivalent queries. Thus, it leads to inaccurate localization results with high uncertainty. 2) Lack of Analysis in More Languages. The study only analyzes language-agnostic knowledge neurons on English and Chinese data, without exploring more language families and languages. Naturally, it limits the generalizability of the findings. To address aforementioned problems, we first construct a new benchmark called Rephrased Multilingual LAMA (RML-LAMA), which contains high-quality cloze-style multilingual parallel queries for each fact. Then, we propose a novel method named Multilingual Integrated Gradients with Uncertainty Estimation (MATRICE), which quantifies the uncertainty across queries and languages during knowledge localization. Extensive experiments show that our method can accurately localize language-agnostic knowledge neurons. We also further investigate the role of language-agnostic knowledge neurons in cross-lingual knowledge editing, knowledge enhancement and new knowledge injection.

Title: CoA: Chain-of-Action for Generative Semantic Labels

Authors: Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17406
Pdf URL: https://arxiv.org/pdf/2411.17406
Copy Paste: [[2411.17406]] CoA: Chain-of-Action for Generative Semantic Labels(https://arxiv.org/abs/2411.17406)
Keywords: generative
Abstract: Recent advances in vision-language models (VLM) have demonstrated remarkable capability in image classification. These VLMs leverage a predefined set of categories to construct text prompts for zero-shot reasoning. However, in more open-ended domains like autonomous driving, using a predefined set of labels becomes impractical, as the semantic label space is unknown and constantly evolving. Additionally, fixed embedding text prompts often tend to predict a single label (while in reality, multiple labels commonly exist per image). In this paper, we introduce CoA, an innovative Chain-of-Action (CoA) method that generates labels aligned with all contextually relevant features of an image. CoA is designed based on the observation that enriched and valuable contextual information improves generative performance during inference. Traditional vision-language models tend to output singular and redundant responses. Therefore, we employ a tailored CoA to alleviate this problem. We first break down the generative labeling task into detailed actions and construct an CoA leading to the final generative objective. Each action extracts and merges key information from the previous action and passes the enriched information as context to the next action, ultimately improving the VLM in generating comprehensive and accurate semantic labels. We assess the effectiveness of CoA through comprehensive evaluations on widely-used benchmark datasets and the results demonstrate significant improvements across key performance metrics.

Title: DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters

Authors: Mingze Sun, Junhao Chen, Junting Dong, Yurun Chen, Xinyu Jiang, Shiwei Mao, Puhua Jiang, Jingbo Wang, Bo Dai, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17423
Pdf URL: https://arxiv.org/pdf/2411.17423
Copy Paste: [[2411.17423]] DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters(https://arxiv.org/abs/2411.17423)
Keywords: diffusion, generative
Abstract: Recent advances in generative models have enabled high-quality 3D character reconstruction from multi-modal. However, animating these generated characters remains a challenging task, especially for complex elements like garments and hair, due to the lack of large-scale datasets and effective rigging methods. To address this gap, we curate AnimeRig, a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. Unlike existing methods, DRiVE utilizes a 3D Gaussian representation, facilitating efficient animation and high-quality rendering. We further introduce GSDiff, a 3D Gaussian-based diffusion module that predicts joint positions as spatial distributions, overcoming the limitations of regression-based approaches. Extensive experiments demonstrate that DRiVE achieves precise rigging results, enabling realistic dynamics for clothing and hair, and surpassing previous methods in both quality and versatility. The code and dataset will be made public for academic use upon acceptance.

Title: Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps

Authors: Xue Xia, Randall Balestriero, Tao Zhang, Lorenz Hurni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17425
Pdf URL: https://arxiv.org/pdf/2411.17425
Copy Paste: [[2411.17425]] Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps(https://arxiv.org/abs/2411.17425)
Keywords: self-supervised
Abstract: Tracking geographic entities from historical maps, such as buildings, offers valuable insights into cultural heritage, urbanization patterns, environmental changes, and various historical research endeavors. However, linking these entities across diverse maps remains a persistent challenge for researchers. Traditionally, this has been addressed through a two-step process: detecting entities within individual maps and then associating them via a heuristic-based post-processing step. In this paper, we propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS). This method significantly streamlines geographic entity alignment and enhances automation. However, acquiring high-quality, video-format training data for VIS models is prohibitively expensive, especially for historical maps that often contain hundreds or thousands of geographic entities. To mitigate this challenge, we explore self-supervised learning (SSL) techniques to enhance VIS performance on historical maps. We evaluate the performance of VIS models under different pretraining configurations and introduce a novel method for generating synthetic videos from unlabeled historical map images for pretraining. Our proposed self-supervised VIS method substantially reduces the need for manual annotation. Experimental results demonstrate the superiority of the proposed self-supervised VIS approach, achieving a 24.9\% improvement in AP and a 0.23 increase in F1 score compared to the model trained from scratch.

Title: Rewiring Techniques to Mitigate Oversquashing and Oversmoothing in GNNs: A Survey

Authors: Hugo Attali, Davide Buscaldi, Nathalie Pernelle
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17429
Pdf URL: https://arxiv.org/pdf/2411.17429
Copy Paste: [[2411.17429]] Rewiring Techniques to Mitigate Oversquashing and Oversmoothing in GNNs: A Survey(https://arxiv.org/abs/2411.17429)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their effectiveness is often constrained by two critical challenges: oversquashing, where the excessive compression of information from distant nodes results in significant information loss, and oversmoothing, where repeated message-passing iterations homogenize node representations, obscuring meaningful distinctions. These issues, intrinsically linked to the underlying graph structure, hinder information flow and constrain the expressiveness of GNNs. In this survey, we examine graph rewiring techniques, a class of methods designed to address these structural bottlenecks by modifying graph topology to enhance information diffusion. We provide a comprehensive review of state-of-the-art rewiring approaches, delving into their theoretical underpinnings, practical implementations, and performance trade-offs.

Title: "Stupid robot, I want to speak to a human!" User Frustration Detection in Task-Oriented Dialog Systems

Authors: Mireia Hernandez Caralt, Ivan Sekulić, Filip Carević, Nghia Khau, Diana Nicoleta Popa, Bruna Guedes, Victor Guimarães, Zeyu Yang, Andre Manso, Meghana Reddy, Paolo Rosso, Roland Mathis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17437
Pdf URL: https://arxiv.org/pdf/2411.17437
Copy Paste: [[2411.17437]] "Stupid robot, I want to speak to a human!" User Frustration Detection in Task-Oriented Dialog Systems(https://arxiv.org/abs/2411.17437)
Keywords: in-context
Abstract: Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16\% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.

Title: Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Authors: Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2411.17440
Pdf URL: https://arxiv.org/pdf/2411.17440
Copy Paste: [[2411.17440]] Identity-Preserving Text-to-Video Generation by Frequency Decomposition(https://arxiv.org/abs/2411.17440)
Keywords: diffusion, generative
Abstract: Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.

Title: VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Authors: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17451
Pdf URL: https://arxiv.org/pdf/2411.17451
Copy Paste: [[2411.17451]] VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models(https://arxiv.org/abs/2411.17451)
Keywords: generative
Abstract: Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

Title: WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Authors: Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, Li Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17459
Pdf URL: https://arxiv.org/pdf/2411.17459
Copy Paste: [[2411.17459]] WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model(https://arxiv.org/abs/2411.17459)
Keywords: diffusion
Abstract: Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at this https URL.

Title: Learning 3D Representations from Procedural 3D Programs

Authors: Xuweiyi Chen, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17467
Pdf URL: https://arxiv.org/pdf/2411.17467
Copy Paste: [[2411.17467]] Learning 3D Representations from Procedural 3D Programs(https://arxiv.org/abs/2411.17467)
Keywords: self-supervised
Abstract: Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics.

Title: Towards Precise Scaling Laws for Video Diffusion Transformers

Authors: Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17470
Pdf URL: https://arxiv.org/pdf/2411.17470
Copy Paste: [[2411.17470]] Towards Precise Scaling Laws for Video Diffusion Transformers(https://arxiv.org/abs/2411.17470)
Keywords: diffusion
Abstract: Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Title: Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Authors: Eric Hanchen Jiang, Yasi Zhang, Zhi Zhang, Yixin Wan, Andrew Lizarraga, Shufan Li, Ying Nian Wu
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17472
Pdf URL: https://arxiv.org/pdf/2411.17472
Copy Paste: [[2411.17472]] Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory(https://arxiv.org/abs/2411.17472)
Keywords: diffusion, generative
Abstract: Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.

Title: Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Authors: Xuweiyi Chen, Markus Marks, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17474
Pdf URL: https://arxiv.org/pdf/2411.17474
Copy Paste: [[2411.17474]] Probing the Mid-level Vision Capabilities of Self-Supervised Learning(https://arxiv.org/abs/2411.17474)
Keywords: self-supervised
Abstract: Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.

Title: SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates

Authors: Yijia Hong, Yuan-Chen Guo, Ran Yi, Yulong Chen, Yan-Pei Cao, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17515
Pdf URL: https://arxiv.org/pdf/2411.17515
Copy Paste: [[2411.17515]] SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates(https://arxiv.org/abs/2411.17515)
Keywords: diffusion
Abstract: Decomposing physically-based materials from images into their constituent properties remains challenging, particularly when maintaining both computational efficiency and physical consistency. While recent diffusion-based approaches have shown promise, they face substantial computational overhead due to multiple denoising steps and separate models for different material properties. We present SuperMat, a single-step framework that achieves high-quality material decomposition with one-step inference. This enables end-to-end training with perceptual and re-render losses while decomposing albedo, metallic, and roughness maps at millisecond-scale speeds. We further extend our framework to 3D objects through a UV refinement network, enabling consistent material estimation across viewpoints while maintaining efficiency. Experiments demonstrate that SuperMat achieves state-of-the-art PBR material decomposition quality while reducing inference time from seconds to milliseconds per image, and completes PBR material estimation for 3D objects in approximately 3 seconds.

Title: FTMoMamba: Motion Generation with Frequency and Text State Space Models

Authors: Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17532
Pdf URL: https://arxiv.org/pdf/2411.17532
Copy Paste: [[2411.17532]] FTMoMamba: Motion Generation with Frequency and Text State Space Models(https://arxiv.org/abs/2411.17532)
Keywords: diffusion
Abstract: Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.

Title: IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework

Authors: Anurag Shandilya, Swapnil Bhat, Akshat Gautam, Subhash Yadav, Siddharth Bhatt, Deval Mehta, Kshitij Jadhav
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17535
Pdf URL: https://arxiv.org/pdf/2411.17535
Copy Paste: [[2411.17535]] IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework(https://arxiv.org/abs/2411.17535)
Keywords: diffusion, generative
Abstract: Generative models have proven to be very effective in generating synthetic medical images and find applications in downstream tasks such as enhancing rare disease datasets, long-tailed dataset augmentation, and scaling machine learning algorithms. For medical applications, the synthetically generated medical images by such models are still reasonable in quality when evaluated based on traditional metrics such as FID score, precision, and recall. However, these metrics fail to capture the medical/biological plausibility of the generated images. Human expert feedback has been used to get biological plausibility which demonstrates that these generated images have very low plausibility. Recently, the research community has further integrated this human feedback through Reinforcement Learning from Human Feedback(RLHF), which generates more medically plausible images. However, incorporating human feedback is a costly and slow process. In this work, we propose a novel approach to improve the medical plausibility of generated images without the need for human feedback. We introduce IMPROVE:Improving Medical Plausibility without Reliance on Human Validation - An Enhanced Prototype-Guided Diffusion Framework, a prototype-guided diffusion process for medical image generation and show that it substantially enhances the biological plausibility of the generated medical images without the need for any human feedback. We perform experiments on Bone Marrow and HAM10000 datasets and show that medical accuracy can be substantially increased without human feedback.

Title: AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments

Authors: Haitham S. Al-Sinani, Chris J. Mitchell
Subjects: cs.CR, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2411.17539
Pdf URL: https://arxiv.org/pdf/2411.17539
Copy Paste: [[2411.17539]] AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments(https://arxiv.org/abs/2411.17539)
Keywords: generative
Abstract: This study explores the application of generative AI (GenAI) within manual exploitation and privilege escalation tasks in Linux-based penetration testing environments, two areas critical to comprehensive cybersecurity assessments. Building on previous research into the role of GenAI in the ethical hacking lifecycle, this paper presents a hands-on experimental analysis conducted in a controlled virtual setup to evaluate the utility of GenAI in supporting these crucial, often manual, tasks. Our findings demonstrate that GenAI can streamline processes, such as identifying potential attack vectors and parsing complex outputs for sensitive data during privilege escalation. The study also identifies key benefits and challenges associated with GenAI, including enhanced efficiency and scalability, alongside ethical concerns related to data privacy, unintended discovery of vulnerabilities, and potential for misuse. This work contributes to the growing field of AI-assisted cybersecurity by emphasising the importance of human-AI collaboration, especially in contexts requiring careful decision-making, rather than the complete replacement of human input.

Title: Pre-training for Action Recognition with Automatically Generated Fractal Datasets

Authors: Davyd Svyezhentsev, George Retsinas, Petros Maragos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17584
Pdf URL: https://arxiv.org/pdf/2411.17584
Copy Paste: [[2411.17584]] Pre-training for Action Recognition with Automatically Generated Fractal Datasets(https://arxiv.org/abs/2411.17584)
Keywords: generative
Abstract: In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at this https URL .

Title: VideoDirector: Precise Video Editing via Text-to-Video Models

Authors: Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, Yulan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17592
Pdf URL: https://arxiv.org/pdf/2411.17592
Copy Paste: [[2411.17592]] VideoDirector: Precise Video Editing via Text-to-Video Models(https://arxiv.org/abs/2411.17592)
Keywords: diffusion, generative
Abstract: Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.

Title: Accelerating Vision Diffusion Transformers with Skip Branches

Authors: Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Cheng Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17616
Pdf URL: https://arxiv.org/pdf/2411.17616
Copy Paste: [[2411.17616]] Accelerating Vision Diffusion Transformers with Skip Branches(https://arxiv.org/abs/2411.17616)
Keywords: diffusion
Abstract: Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches. Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup. Experimental results indicate that Skip-DiT achieves a 1.5x speedup almost for free and a 2.2x speedup with only a minor reduction in quantitative metrics. Code is available at this https URL.

Title: A robust image encryption scheme based on new 4-D hyperchaotic system and elliptic curve

Authors: Yehia Lalili, Toufik Bouden, Morad Grimes, Abderrazek Lachouri
Subjects: cs.CR, eess.IV
Abstract URL: https://arxiv.org/abs/2411.17643
Pdf URL: https://arxiv.org/pdf/2411.17643
Copy Paste: [[2411.17643]] A robust image encryption scheme based on new 4-D hyperchaotic system and elliptic curve(https://arxiv.org/abs/2411.17643)
Keywords: diffusion
Abstract: In this work, a new 4-D hyperchaotic system for image encryption is proposed and its effectiveness is demonstrated by incorporating it into an existing Elliptic Curve Cryptography (ECC) mapping scheme. The proposed system is considered simple because it consists of eight terms with two nonlinearities. The system exhibits high sensitivity to initial conditions, which makes it suitable for encryption purposes. The two-stage encryption process, involving confusion and diffusion, is employed to protect the confidentiality of digital images. The simulation results demonstrate the effectiveness of the hyperchaotic system in terms of security and performance when combined with the ECC mapping scheme. This approach can be applied in various domains including healthcare, military, and entertainment to ensure the robust encryption of digital images.

Title: How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations

Authors: Hyunji Lee, Danni Liu, Supriti Sinhamahapatra, Jan Niehues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17666
Pdf URL: https://arxiv.org/pdf/2411.17666
Copy Paste: [[2411.17666]] How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations(https://arxiv.org/abs/2411.17666)
Keywords: foundation model
Abstract: Multimodal foundation models aim to create a unified representation space that abstracts away from surface features like language syntax or modality differences. To investigate this, we study the internal representations of three recent models, analyzing the model activations from semantically equivalent sentences across languages in the text and speech modalities. Our findings reveal that: 1) Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. 2) Length adaptation is crucial for reducing the cross-modal gap between text and speech, although current approaches' effectiveness is primarily limited to high-resource languages. 3) Speech exhibits larger cross-lingual differences than text. 4) For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.

Title: SketchAgent: Language-Driven Sequential Sketch Generation

Authors: Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17673
Pdf URL: https://arxiv.org/pdf/2411.17673
Copy Paste: [[2411.17673]] SketchAgent: Language-Driven Sequential Sketch Generation(https://arxiv.org/abs/2411.17673)
Keywords: in-context
Abstract: Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.

Title: GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration

Authors: Sudarshan Rajagopalan, Nithin Gopalakrishnan Nair, Jay N. Paranjape, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17687
Pdf URL: https://arxiv.org/pdf/2411.17687
Copy Paste: [[2411.17687]] GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration(https://arxiv.org/abs/2411.17687)
Keywords: diffusion, generative
Abstract: Deep learning-based models for All-In-One Image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on the GenDS dataset exhibit significant improvements in out-of-distribution performance compared to those trained solely on existing datasets. Furthermore, we provide comprehensive analyses on the implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.

Title: ScribbleLight: Single Image Indoor Relighting with Scribbles

Authors: Jun Myeong Choi, Annie Wang, Pieter Peers, Anand Bhattad, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17696
Pdf URL: https://arxiv.org/pdf/2411.17696
Copy Paste: [[2411.17696]] ScribbleLight: Single Image Indoor Relighting with Scribbles(https://arxiv.org/abs/2411.17696)
Keywords: diffusion, generative
Abstract: Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight's ability to create different lighting effects (e.g., turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.

Title: StableAnimator: High-Quality Identity-Preserving Human Image Animation

Authors: Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17697
Pdf URL: https://arxiv.org/pdf/2411.17697
Copy Paste: [[2411.17697]] StableAnimator: High-Quality Identity-Preserving Human Image Animation(https://arxiv.org/abs/2411.17697)
Keywords: diffusion
Abstract: Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.