2025-03-26

Title: Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks

Authors: Liang Zhang, Jionghao Lin, John Sabatini, Diego Zapata-Rivera, Carol Forsyth, Yang Jiang, John Hollander, Xiangen Hu, Arthur C. Graesser
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18982
Pdf URL: https://arxiv.org/pdf/2503.18982
Copy Paste: [[2503.18982]] Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks(https://arxiv.org/abs/2503.18982)
Keywords: generative
Abstract: Learner performance data collected by Intelligent Tutoring Systems (ITSs), such as responses to questions, is essential for modeling and predicting learners' knowledge states. However, missing responses due to skips or incomplete attempts create data sparsity, challenging accurate assessment and personalized instruction. To address this, we propose a generative imputation approach using Generative Adversarial Imputation Networks (GAIN). Our method features a three-dimensional (3D) framework (learners, questions, and attempts), flexibly accommodating various sparsity levels. Enhanced by convolutional neural networks and optimized with a least squares loss function, the GAIN-based method aligns input and output dimensions to question-attempt matrices along the learners' dimension. Extensive experiments using datasets from AutoTutor Adult Reading Comprehension (ARC), ASSISTments, and MATHia demonstrate that our approach significantly outperforms tensor factorization and alternative GAN methods in imputation accuracy across different attempt scenarios. Bayesian Knowledge Tracing (BKT) further validates the effectiveness of the imputed data by estimating learning parameters: initial knowledge (P(L0)), learning rate (P(T)), guess rate (P(G)), and slip rate (P(S)). Results indicate the imputed data enhances model fit and closely mirrors original distributions, capturing underlying learning behaviors reliably. Kullback-Leibler (KL) divergence assessments confirm minimal divergence, showing the imputed data preserves essential learning characteristics effectively. These findings underscore GAIN's capability as a robust imputation tool in ITSs, alleviating data sparsity and supporting adaptive, individualized instruction, ultimately leading to more precise and responsive learner assessments and improved educational outcomes.

Title: DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

Authors: Kangwei Liu, Junwu Liu, Yun Cao, Jinlin Guo, Xiaowei Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19001
Pdf URL: https://arxiv.org/pdf/2503.19001
Copy Paste: [[2503.19001]] DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model(https://arxiv.org/abs/2503.19001)
Keywords: diffusion
Abstract: Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: this https URL.

Title: RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

Authors: Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Jiaao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19011
Pdf URL: https://arxiv.org/pdf/2503.19011
Copy Paste: [[2503.19011]] RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis(https://arxiv.org/abs/2503.19011)
Keywords: diffusion
Abstract: Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.

Title: DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding

Authors: Lingyan Ran, Lidong Wang, Guangcong Wang, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19012
Pdf URL: https://arxiv.org/pdf/2503.19012
Copy Paste: [[2503.19012]] DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding(https://arxiv.org/abs/2503.19012)
Keywords: diffusion
Abstract: The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at this https URL.

Title: Color Conditional Generation with Sliced Wasserstein Guidance

Authors: Alexander Lobashev, Maria Larchenko, Dmitry Guskov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19034
Pdf URL: https://arxiv.org/pdf/2503.19034
Copy Paste: [[2503.19034]] Color Conditional Generation with Sliced Wasserstein Guidance(https://arxiv.org/abs/2503.19034)
Keywords: diffusion
Abstract: We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at this https URL.

Title: HingeRLC-GAN: Combating Mode Collapse with Hinge Loss and RLC Regularization

Authors: Osman Goni, Himadri Saha Arka, Mithun Halder, Mir Moynuddin Ahmed Shibly, Swakkhar Shatabda
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19074
Pdf URL: https://arxiv.org/pdf/2503.19074
Copy Paste: [[2503.19074]] HingeRLC-GAN: Combating Mode Collapse with Hinge Loss and RLC Regularization(https://arxiv.org/abs/2503.19074)
Keywords: generative
Abstract: Recent advances in Generative Adversarial Networks (GANs) have demonstrated their capability for producing high-quality images. However, a significant challenge remains mode collapse, which occurs when the generator produces a limited number of data patterns that do not reflect the diversity of the training dataset. This study addresses this issue by proposing a number of architectural changes aimed at increasing the diversity and stability of GAN models. We start by improving the loss function with Wasserstein loss and Gradient Penalty to better capture the full range of data variations. We also investigate various network architectures and conclude that ResNet significantly contributes to increased diversity. Building on these findings, we introduce HingeRLC-GAN, a novel approach that combines RLC Regularization and the Hinge loss function. With a FID Score of 18 and a KID Score of 0.001, our approach outperforms existing methods by effectively balancing training stability and increased diversity.

Title: Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training

Authors: Amin Totounferoush, Serge Kotchourko, Michael W. Mahoney, Steffen Staab
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19081
Pdf URL: https://arxiv.org/pdf/2503.19081
Copy Paste: [[2503.19081]] Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training(https://arxiv.org/abs/2503.19081)
Keywords: diffusion, foundation model
Abstract: Partial differential equations (PDEs) govern a wide range of physical systems, but solving them efficiently remains a major challenge. The idea of a scientific foundation model (SciFM) is emerging as a promising tool for learning transferable representations across diverse domains. However, SciFMs require large amounts of solution data, which may be scarce or computationally expensive to generate. To maximize generalization while reducing data dependence, we propose incorporating PDE residuals into pre-training either as the sole learning signal or in combination with data loss to compensate for limited or infeasible training data. We evaluate this constraint-aware pre-training across three key benchmarks: (i) generalization to new physics, where material properties, e.g., the diffusion coefficient, is shifted with respect to the training distribution; (ii) generalization to entirely new PDEs, requiring adaptation to different operators; and (iii) robustness against noisy fine-tuning data, ensuring stability in real-world applications. Our results show that pre-training with PDE constraints significantly enhances generalization, outperforming models trained solely on solution data across all benchmarks. These findings prove the effectiveness of our proposed constraint-aware pre-training as a crucial component for SciFMs, providing a scalable approach to data-efficient, generalizable PDE solvers.

Title: Anomaly Detection Using Computer Vision: A Comparative Analysis of Class Distinction and Performance Metrics

Authors: Md. Barkat Ullah Tusher, Shartaz Khan Akash, Amirul Islam Showmik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19100
Pdf URL: https://arxiv.org/pdf/2503.19100
Copy Paste: [[2503.19100]] Anomaly Detection Using Computer Vision: A Comparative Analysis of Class Distinction and Performance Metrics(https://arxiv.org/abs/2503.19100)
Keywords: anomaly
Abstract: This paper showcases an experimental study on anomaly detection using computer vision. The study focuses on class distinction and performance evaluation, combining OpenCV with deep learning techniques while employing a TensorFlow-based convolutional neural network for real-time face recognition and classification. The system effectively distinguishes among three classes: authorized personnel (admin), intruders, and non-human entities. A MobileNetV2-based deep learning model is utilized to optimize real-time performance, ensuring high computational efficiency without compromising accuracy. Extensive dataset preprocessing, including image augmentation and normalization, enhances the models generalization capabilities. Our analysis demonstrates classification accuracies of 90.20% for admin, 98.60% for intruders, and 75.80% for non-human detection, while maintaining an average processing rate of 30 frames per second. The study leverages transfer learning, batch normalization, and Adam optimization to achieve stable and robust learning, and a comparative analysis of class differentiation strategies highlights the impact of feature extraction techniques and training methodologies. The results indicate that advanced feature selection and data augmentation significantly enhance detection performance, particularly in distinguishing human from non-human scenes. As an experimental study, this research provides critical insights into optimizing deep learning-based surveillance systems for high-security environments and improving the accuracy and efficiency of real-time anomaly detection.

Title: MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Authors: Wenhao You, Bryan Hooi, Yiwei Wang, Youke Wang, Zong Ke, Ming-Hsuan Yang, Zi Huang, Yujun Cai
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2503.19134
Pdf URL: https://arxiv.org/pdf/2503.19134
Copy Paste: [[2503.19134]] MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks(https://arxiv.org/abs/2503.19134)
Keywords: diffusion
Abstract: While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

Title: Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants

Authors: Yorick Estievenart, Sukanya Patra, Souhaib Ben Taieb
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19146
Pdf URL: https://arxiv.org/pdf/2503.19146
Copy Paste: [[2503.19146]] Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants(https://arxiv.org/abs/2503.19146)
Keywords: generative, anomaly
Abstract: Efficient and reliable operation of Concentrated Solar Power (CSP) plants is essential for meeting the growing demand for sustainable energy. However, high-temperature solar receivers face severe operational risks, such as freezing, deformation, and corrosion, resulting in costly downtime and maintenance. To monitor CSP plants, cameras mounted on solar receivers record infrared images at irregular intervals ranging from one to five minutes throughout the day. Anomalous images can be detected by thresholding an anomaly score, where the threshold is chosen to optimize metrics such as the F1-score on a validation set. This work proposes a framework for generating more reliable decision thresholds with finite-sample coverage guarantees on any chosen risk function. Our framework also incorporates an abstention mechanism, allowing high-risk predictions to be deferred to domain experts. Second, we propose a density forecasting method to estimate the likelihood of an observed image given a sequence of previously observed images, using this likelihood as its anomaly score. Third, we analyze the deployment results of our framework across multiple training scenarios over several months for two CSP plants. This analysis provides valuable insights to our industry partner for optimizing maintenance operations. Finally, given the confidential nature of our dataset, we provide an extended simulated dataset, leveraging recent advancements in generative modeling to create diverse thermal images that simulate multiple CSP plants. Our code is publicly available.

Title: HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Authors: Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19157
Pdf URL: https://arxiv.org/pdf/2503.19157
Copy Paste: [[2503.19157]] HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models(https://arxiv.org/abs/2503.19157)
Keywords: generative
Abstract: We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

Title: SoK: How Robust is Audio Watermarking in Generative AI models?

Authors: Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, Qiben Yan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19176
Pdf URL: https://arxiv.org/pdf/2503.19176
Copy Paste: [[2503.19176]] SoK: How Robust is Audio Watermarking in Generative AI models?(https://arxiv.org/abs/2503.19176)
Keywords: generative
Abstract: Audio watermarking is increasingly used to verify the provenance of AI-generated content, enabling applications such as detecting AI-generated speech, protecting music IP, and defending against voice cloning. To be effective, audio watermarks must resist removal attacks that distort signals to evade detection. While many schemes claim robustness, these claims are typically tested in isolation and against a limited set of attacks. A systematic evaluation against diverse removal attacks is lacking, hindering practical deployment. In this paper, we investigate whether recent watermarking schemes that claim robustness can withstand a broad range of removal attacks. First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we summarize their underlying technologies and potential vulnerabilities. We then present a large-scale empirical study to assess their robustness. To support this, we build an evaluation framework encompassing 22 types of removal attacks (109 configurations) including signal-level, physical-level, and AI-induced distortions. We reproduce 9 watermarking schemes using open-source code, identify 8 new highly effective attacks, and highlight 11 key findings that expose the fundamental limitations of these methods across 3 public datasets. Our results reveal that none of the surveyed schemes can withstand all tested distortions. This evaluation offers a comprehensive view of how current watermarking methods perform under real-world threats. Our demo and code are available at this https URL.

Title: FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

Authors: Yufan Ren, Zicong Jiang, Tong Zhang, Søren Forchhammer, Sabine Süsstrunk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19191
Pdf URL: https://arxiv.org/pdf/2503.19191
Copy Paste: [[2503.19191]] FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing(https://arxiv.org/abs/2503.19191)
Keywords: diffusion
Abstract: Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.

Title: Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Authors: Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, Francis Engelmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19199
Pdf URL: https://arxiv.org/pdf/2503.19199
Copy Paste: [[2503.19199]] Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces(https://arxiv.org/abs/2503.19199)
Keywords: foundation model
Abstract: We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at this https URL

Title: Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing

Authors: Ruiyi Wang, Yushuo Zheng, Zicheng Zhang, Chunyi Li, Shuaicheng Liu, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19262
Pdf URL: https://arxiv.org/pdf/2503.19262
Copy Paste: [[2503.19262]] Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing(https://arxiv.org/abs/2503.19262)
Keywords: diffusion, generative
Abstract: Existing real-world image dehazing methods primarily attempt to fine-tune pre-trained models or adapt their inference procedures, thus heavily relying on the pre-trained models and associated training data. Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). Specifically, HazeGen harnesses robust generative diffusion priors of real-world hazy images embedded in a pre-trained text-to-image diffusion model. By employing specialized hybrid training and blended sampling strategies, HazeGen produces realistic and diverse hazy images as high-quality training data for DiffDehaze. To alleviate the inefficiency and fidelity concerns associated with diffusion-based methods, DiffDehaze adopts an Accelerated Fidelity-Preserving Sampling process (AccSamp). The core of AccSamp is the Tiled Statistical Alignment Operation (AlignOp), which can provide a clean and faithful dehazing estimate within a small fraction of sampling steps to reduce complexity and enable effective fidelity guidance. Extensive experiments demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The code is available at this https URL.

Title: ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency

Authors: Yang Ren, Hai Jiang, Menglong Yang, Wei Li, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19283
Pdf URL: https://arxiv.org/pdf/2503.19283
Copy Paste: [[2503.19283]] ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency(https://arxiv.org/abs/2503.19283)
Keywords: diffusion, generative
Abstract: RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually. The code is available at this https URL.

Title: Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment

Authors: Guanglu Dong, Xiangyu Liao, Mingyang Li, Guihuan Guo, Chao Ren
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19295
Pdf URL: https://arxiv.org/pdf/2503.19295
Copy Paste: [[2503.19295]] Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment(https://arxiv.org/abs/2503.19295)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have been widely applied to image super-resolution (SR) to enhance the perceptual quality. However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. Specifically, we first design a feature discriminator (Feat-D), to discriminate the pixel-wise middle semantic features from CLIP, aligning the feature distributions of SR images with that of high-quality images. Additionally, we propose a text-guided discrimination method (TG-D) by introducing learnable prompt pairs (LPP) in an adversarial manner to perform discrimination on the more abstract output feature of CLIP, further enhancing the discriminative ability of our method. With both Feat-D and TG-D, our SFD can effectively distinguish between the semantic feature distributions of low-quality and high-quality images, encouraging SRN to generate more realistic and semantic-relevant textures. Furthermore, based on the trained Feat-D and LPP, we propose a novel opinion-unaware no-reference image quality assessment (OU NR-IQA) method, SFD-IQA, greatly improving OU NR-IQA performance without any additional targeted training. Extensive experiments on classical SISR, real-world SISR, and OU NR-IQA tasks demonstrate the effectiveness of our proposed methods.

Title: UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

Authors: Xiangzhe Kong, Zishen Zhang, Ziting Zhang, Rui Jiao, Jianzhu Ma, Kai Liu, Wenbing Huang, Yang Liu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.19300
Pdf URL: https://arxiv.org/pdf/2503.19300
Copy Paste: [[2503.19300]] UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design(https://arxiv.org/abs/2503.19300)
Keywords: diffusion, generative
Abstract: The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduce Unified generative Modeling of 3D Molecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

Title: LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

Authors: Weizhi Chen, Jingbo Chen, Yupeng Deng, Jiansheng Chen, Yuman Feng, Zhihao Xi, Diyou Liu, Kai Li, Yu Meng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19311
Pdf URL: https://arxiv.org/pdf/2503.19311
Copy Paste: [[2503.19311]] LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text(https://arxiv.org/abs/2503.19311)
Keywords: foundation model
Abstract: This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at this https URL.

Title: ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Authors: Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19312
Pdf URL: https://arxiv.org/pdf/2503.19312
Copy Paste: [[2503.19312]] ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning(https://arxiv.org/abs/2503.19312)
Keywords: in-context
Abstract: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at this https URL. Code and model weights will be open-sourced.

Title: Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing

Authors: Ahmed Omara, Burak Kantarci
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19318
Pdf URL: https://arxiv.org/pdf/2503.19318
Copy Paste: [[2503.19318]] Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing(https://arxiv.org/abs/2503.19318)
Keywords: generative
Abstract: As Artificial Intelligence (AI) becomes increasingly integrated into microgrid control systems, the risk of malicious actors exploiting vulnerabilities in Machine Learning (ML) algorithms to disrupt power generation and distribution grows. Detection models to identify adversarial attacks need to meet the constraints of edge environments, where computational power and memory are often limited. To address this issue, we propose a novel strategy that optimizes detection models for Vehicle-to-Microgrid (V2M) edge environments without compromising performance against inference and evasion attacks. Our approach integrates model design and compression into a unified process and results in a highly compact detection model that maintains high accuracy. We evaluated our method against four benchmark evasion attacks-Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Carlini & Wagner method (C&W) and Conditional Generative Adversarial Network (CGAN) method-and two knowledge-based attacks, white-box and gray-box. Our optimized model reduces memory usage from 20MB to 1.3MB, inference time from 3.2 seconds to 0.9 seconds, and GPU utilization from 5% to 2.68%.

Title: Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Authors: Yuchao Gu, Weijia Mao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19325
Pdf URL: https://arxiv.org/pdf/2503.19325
Copy Paste: [[2503.19325]] Long-Context Autoregressive Video Modeling with Next-Frame Prediction(https://arxiv.org/abs/2503.19325)
Keywords: diffusion
Abstract: Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

Title: BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction

Authors: Yuguang Li, Ivaylo Boyadzhiev, Zixuan Liu, Linda Shapiro, Alex Colburn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19340
Pdf URL: https://arxiv.org/pdf/2503.19340
Copy Paste: [[2503.19340]] BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction(https://arxiv.org/abs/2503.19340)
Keywords: diffusion
Abstract: Reconstructing precise camera poses and floor plan layouts from wide-baseline RGB panoramas is a difficult and unsolved problem. We introduce BADGR, a novel diffusion model that jointly performs reconstruction and bundle adjustment (BA) to refine poses and layouts from a coarse state, using 1D floor boundary predictions from dozens of images of varying input densities. Unlike a guided diffusion model, BADGR is conditioned on dense per-entity outputs from a single-step Levenberg Marquardt (LM) optimizer and is trained to predict camera and wall positions while minimizing reprojection errors for view-consistency. The objective of layout generation from denoising diffusion process complements BA optimization by providing additional learned layout-structural constraints on top of the co-visible features across images. These constraints help BADGR to make plausible guesses on spatial relations which help constrain pose graph, such as wall adjacency, collinearity, and learn to mitigate errors from dense boundary observations with global contexts. BADGR trains exclusively on 2D floor plans, simplifying data acquisition, enabling robust augmentation, and supporting variety of input densities. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art pose and floor plan layout reconstruction with different input densities.

Title: Data-driven Mesoscale Weather Forecasting Combining Swin-Unet and Diffusion Models

Authors: Yuta Hirabayashi, Daisuke Matsuoka
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2503.19354
Pdf URL: https://arxiv.org/pdf/2503.19354
Copy Paste: [[2503.19354]] Data-driven Mesoscale Weather Forecasting Combining Swin-Unet and Diffusion Models(https://arxiv.org/abs/2503.19354)
Keywords: diffusion
Abstract: Data-driven weather prediction models exhibit promising performance and advance continuously. In particular, diffusion models represent fine-scale details without spatial smoothing, which is crucial for mesoscale predictions, such as heavy rainfall forecasting. However, the applications of diffusion models to mesoscale prediction remain limited. To address this gap, this study proposes an architecture that combines a diffusion model with Swin-Unet as a deterministic model, achieving mesoscale predictions while maintaining flexibility. The proposed architecture trains the two models independently, allowing the diffusion model to remain unchanged when the deterministic model is updated. Comparisons using the Fractions Skill Score and power spectral analysis demonstrate that incorporating the diffusion model leads to improved accuracy compared to predictions without it. These findings underscore the potential of the proposed architecture to enhance mesoscale predictions, particularly for strong rainfall events, while maintaining flexibility.

Title: Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection

Authors: Farzad Beizaee, Gregory A. Lodygensky, Christian Desrosiers, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19357
Pdf URL: https://arxiv.org/pdf/2503.19357
Copy Paste: [[2503.19357]] Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection(https://arxiv.org/abs/2503.19357)
Keywords: diffusion, anomaly
Abstract: Recent advances in diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions of an image while preserving normal ones. This leads to potential degradation of normal regions during reconstruction, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. By modeling anomalies as noise in the latent space, our proposed \textbf{Deviation correction diffusion} (\Ours) model preserves the normal regions and encourages transformations exclusively on anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14\% over state-of-the-art models on well known anomaly detection datasets. The code is available at this https URL

Title: Show and Segment: Universal Medical Image Segmentation via In-Context Learning

Authors: Yunhe Gao, Di Liu, Zhuowei Li, Yunsheng Li, Dongdong Chen, Mu Zhou, Dimitris N. Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19359
Pdf URL: https://arxiv.org/pdf/2503.19359
Copy Paste: [[2503.19359]] Show and Segment: Universal Medical Image Segmentation via In-Context Learning(https://arxiv.org/abs/2503.19359)
Keywords: in-context
Abstract: Medical image segmentation remains challenging due to the vast diversity of anatomical structures, imaging modalities, and segmentation tasks. While deep learning has made significant advances, current approaches struggle to generalize as they require task-specific training or fine-tuning on unseen classes. We present Iris, a novel In-context Reference Image guided Segmentation framework that enables flexible adaptation to novel tasks through the use of reference examples without fine-tuning. At its core, Iris features a lightweight context task encoding module that distills task-specific information from reference context image-label pairs. This rich context embedding information is used to guide the segmentation of target objects. By decoupling task encoding from inference, Iris supports diverse strategies from one-shot inference and context example ensemble to object-level context example retrieval and in-context tuning. Through comprehensive evaluation across twelve datasets, we demonstrate that Iris performs strongly compared to task-specific models on in-distribution tasks. On seven held-out datasets, Iris shows superior generalization to out-of-distribution data and unseen classes. Further, Iris's task encoding module can automatically discover anatomical relationships across datasets and modalities, offering insights into medical objects without explicit anatomical supervision.

Title: ImageSet2Text: Describing Sets of Images through Text

Authors: Piera Riccio, Francesco Galati, Kajetan Schweighofer, Noa Garcia, Nuria Oliver
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19361
Pdf URL: https://arxiv.org/pdf/2503.19361
Copy Paste: [[2503.19361]] ImageSet2Text: Describing Sets of Images through Text(https://arxiv.org/abs/2503.19361)
Keywords: foundation model
Abstract: We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.

Title: VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction

Authors: Zizhi Chen, Minghao Han, Xukun Zhang, Shuwei Ma, Tao Liu, Xing Wei, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19367
Pdf URL: https://arxiv.org/pdf/2503.19367
Copy Paste: [[2503.19367]] VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction(https://arxiv.org/abs/2503.19367)
Keywords: generative
Abstract: Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is this https URL.

Title: EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

Authors: Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19369
Pdf URL: https://arxiv.org/pdf/2503.19369
Copy Paste: [[2503.19369]] EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models(https://arxiv.org/abs/2503.19369)
Keywords: diffusion, generative
Abstract: The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose \textbf{EfficientMT}, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available this https URL.

Title: DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image

Authors: Hyeongjin Nam, Donghwan Kim, Jeongtaek Oh, Kyoung Mu Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19373
Pdf URL: https://arxiv.org/pdf/2503.19373
Copy Paste: [[2503.19373]] DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image(https://arxiv.org/abs/2503.19373)
Keywords: diffusion
Abstract: Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. This task remains largely unexplored due to the extreme occlusion between cloth and the human body, making it challenging to infer accurate geometries and textures. Moreover, while recent 3D human reconstruction methods have achieved impressive results using text-to-image diffusion models, directly applying such an approach to this problem often leads to incorrect guidance, particularly in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of cloth and human body as regularizations, which provide strong geometric priors to prevent erroneous reconstruction by the occlusion. Second, we introduce a cloth diffusion model specifically designed to provide contextual information about cloth appearance, thereby enhancing the reconstruction of 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approach is highly effective in reconstructing both 3D cloth and the human body. More qualitative results are provided at this https URL.

Title: Interpretable Generative Models through Post-hoc Concept Bottlenecks

Authors: Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19377
Pdf URL: https://arxiv.org/pdf/2503.19377
Copy Paste: [[2503.19377]] Interpretable Generative Models through Post-hoc Concept Bottlenecks(https://arxiv.org/abs/2503.19377)
Keywords: diffusion, generative
Abstract: Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.

Title: Social Network User Profiling for Anomaly Detection Based on Graph Neural Networks

Authors: Yiwei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19380
Pdf URL: https://arxiv.org/pdf/2503.19380
Copy Paste: [[2503.19380]] Social Network User Profiling for Anomaly Detection Based on Graph Neural Networks(https://arxiv.org/abs/2503.19380)
Keywords: self-supervised, anomaly
Abstract: This study proposes a risk pricing anomaly detection method for social network user portraits based on graph neural networks (GNNs), aiming to improve the ability to identify abnormal users in social network environments. In view of the limitations of traditional methods in social network data modeling, this paper combines graph autoencoders (GAEs) and graph attention networks (GATs) to achieve accurate detection of abnormal users through dynamic aggregation of neighbor features and reconstruction error evaluation. The Facebook Page-Page Network dataset is used in the experiment and compared with VAE, GNN, Transformer and GAE. The results show that the proposed method achieves the best performance in AUC, F1-score, Precision and Recall, verifying its effectiveness. In addition, this paper explores the computational efficiency of the model in large-scale data and looks forward to combining self-supervised learning, federated learning, and other technologies in the future to improve the robustness and privacy protection of risk assessment. The research results can provide efficient anomaly detection solutions for financial risk control, social security management, and other fields.

Title: MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

Authors: Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela S.M. Lau, Guosheng Yin, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19383
Pdf URL: https://arxiv.org/pdf/2503.19383
Copy Paste: [[2503.19383]] MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation(https://arxiv.org/abs/2503.19383)
Keywords: diffusion
Abstract: Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.

Title: Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

Authors: Jaihoon Kim, Taehoon Yoon, Jisung Hwang, Minhyuk Sung
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19385
Pdf URL: https://arxiv.org/pdf/2503.19385
Copy Paste: [[2503.19385]] Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing(https://arxiv.org/abs/2503.19385)
Keywords: diffusion, generative
Abstract: We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models--offering faster generation and high-quality outputs in state-of-the-art image and video generative models--efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

Title: Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models

Authors: Masaya Hasegawa, Koji Yasuda
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19429
Pdf URL: https://arxiv.org/pdf/2503.19429
Copy Paste: [[2503.19429]] Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models(https://arxiv.org/abs/2503.19429)
Keywords: diffusion
Abstract: Diffusion models, which have been advancing rapidly in recent years, may generate samples that closely resemble the training data. This phenomenon, known as memorization, may lead to copyright issues. In this study, we propose a method to quantify the ease of reproducing training data in unconditional diffusion models. The average of a sample population following the Langevin equation in the reverse diffusion process moves according to a first-order ordinary differential equation (ODE). This ODE establishes a 1-to-1 correspondence between images and their noisy counterparts in the latent space. Since the ODE is reversible and the initial noisy images are sampled randomly, the volume of an image's projected area represents the probability of generating those images. We examined the ODE, which projects images to latent space, and succeeded in quantifying the ease of reproducing training data by measuring the volume growth rate in this process. Given the relatively low computational complexity of this method, it allows us to enhance the quality of training data by detecting and modifying the easily memorized training samples.

Title: Towards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model

Authors: Changyong He, Jin Zeng, Jiawei Zhang, Jiajie Guo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19448
Pdf URL: https://arxiv.org/pdf/2503.19448
Copy Paste: [[2503.19448]] Towards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model(https://arxiv.org/abs/2503.19448)
Keywords: diffusion
Abstract: Time-of-Flight (ToF) sensors efficiently capture scene depth, but the nonlinear depth construction procedure often results in extremely large noise variance or even invalid areas. Recent methods based on deep neural networks (DNNs) achieve enhanced ToF denoising accuracy but tend to struggle when presented with severe noise corruption due to limited prior knowledge of ToF data distribution. In this paper, we propose DepthCAD, a novel ToF denoising approach that ensures global structural smoothness by leveraging the rich prior knowledge in Stable Diffusion and maintains local metric accuracy by steering the diffusion process with confidence guidance. To adopt the pretrained image diffusion model to ToF depth denoising, we apply the diffusion on raw ToF correlation measurements with dynamic range normalization before converting to depth maps. Experimental results validate the state-of-the-art performance of the proposed scheme, and the evaluation on real data further verifies its robustness against real-world ToF noise.

Title: SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors

Authors: Yiqing Li, Xuan Wang, Jiawei Wu, Yikun Ma, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19452
Pdf URL: https://arxiv.org/pdf/2503.19452
Copy Paste: [[2503.19452]] SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors(https://arxiv.org/abs/2503.19452)
Keywords: diffusion, generative
Abstract: Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.

Title: G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

Authors: Juntao Jian, Xiuping Liu, Zixuan Chen, Manyi Li, Jian Liu, Ruizhen Hu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.19457
Pdf URL: https://arxiv.org/pdf/2503.19457
Copy Paste: [[2503.19457]] G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation(https://arxiv.org/abs/2503.19457)
Keywords: generative
Abstract: Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: this https URL

Title: AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

Authors: Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19462
Pdf URL: https://arxiv.org/pdf/2503.19462
Copy Paste: [[2503.19462]] AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset(https://arxiv.org/abs/2503.19462)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

Title: Noisier2Inverse: Self-Supervised Learning for Image Reconstruction with Correlated Noise

Authors: Nadja Gruber, Johannes Schwab, Markus Haltmeier, Ander Biguri, Clemens Dlaska, Gyeongha Hwang
Subjects: cs.CV, eess.IV, math.OC
Abstract URL: https://arxiv.org/abs/2503.19468
Pdf URL: https://arxiv.org/pdf/2503.19468
Copy Paste: [[2503.19468]] Noisier2Inverse: Self-Supervised Learning for Image Reconstruction with Correlated Noise(https://arxiv.org/abs/2503.19468)
Keywords: self-supervised
Abstract: We propose Noisier2Inverse, a correction-free self-supervised deep learning approach for general inverse prob- lems. The proposed method learns a reconstruction function without the need for ground truth samples and is ap- plicable in cases where measurement noise is statistically correlated. This includes computed tomography, where detector imperfections or photon scattering create correlated noise patterns, as well as microscopy and seismic imaging, where physical interactions during measurement introduce dependencies in the noise structure. Similar to Noisier2Noise, a key step in our approach is the generation of noisier data from which the reconstruction net- work learns. However, unlike Noisier2Noise, the proposed loss function operates in measurement space and is trained to recover an extrapolated image instead of the original noisy one. This eliminates the need for an extrap- olation step during inference, which would otherwise suffer from ill-posedness. We numerically demonstrate that our method clearly outperforms previous self-supervised approaches that account for correlated noise.

Title: GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Authors: Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19480
Pdf URL: https://arxiv.org/pdf/2503.19480
Copy Paste: [[2503.19480]] GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers(https://arxiv.org/abs/2503.19480)
Keywords: generative
Abstract: The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

Title: KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models

Authors: Zhiwei Wang, Zhongxin Liu, Ying Li, Hongyu Sun, Meng Xu, Yuqing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19482
Pdf URL: https://arxiv.org/pdf/2503.19482
Copy Paste: [[2503.19482]] KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models(https://arxiv.org/abs/2503.19482)
Keywords: generative
Abstract: The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.

Title: Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage

Authors: Zhengwentai Sun, Heyuan Li, Xihe Yang, Keru Zheng, Shuliang Ning, Yihao Zhi, Hongjie Liao, Chenghong Li, Shuguang Cui, Xiaoguang Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19486
Pdf URL: https://arxiv.org/pdf/2503.19486
Copy Paste: [[2503.19486]] Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage(https://arxiv.org/abs/2503.19486)
Keywords: generative
Abstract: Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produce unsatisfacotry results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: this https URL.

Title: SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling

Authors: Yaofei Wang, Gang Pei, Kejiang Chen, Jinyang Ding, Chao Pan, Weilong Pang, Donghui Hu, Weiming Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.19499
Pdf URL: https://arxiv.org/pdf/2503.19499
Copy Paste: [[2503.19499]] SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling(https://arxiv.org/abs/2503.19499)
Keywords: generative
Abstract: Steganography embeds confidential data within seemingly innocuous communications. Provable security in steganography, a long-sought goal, has become feasible with deep generative models. However, existing methods face a critical trade-off between security and efficiency. This paper introduces SparSamp, an efficient provably secure steganography method based on sparse sampling. SparSamp embeds messages by combining them with pseudo-random numbers to obtain message-derived random numbers for sampling. It enhances extraction accuracy and embedding capacity by increasing the sampling intervals and making the sampling process sparse. SparSamp preserves the original probability distribution of the generative model, thus ensuring security. It introduces only $O(1)$ additional complexity per sampling step, enabling the fastest embedding speed without compromising generation speed. SparSamp is designed to be plug-and-play; message embedding can be achieved by simply replacing the sampling component of an existing generative model with SparSamp. We implemented SparSamp in text, image, and audio generation models. It can achieve embedding speeds of up to 755 bits/second with GPT-2, 5046 bits/second with DDPM, and 9,223 bits/second with WaveRNN.

Title: VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models

Authors: Suhas G Hegde, Shilpy Kaur, Aruna Tiwari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19530
Pdf URL: https://arxiv.org/pdf/2503.19530
Copy Paste: [[2503.19530]] VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models(https://arxiv.org/abs/2503.19530)
Keywords: foundation model
Abstract: Popular PEFT methods achieve parameter efficiency by assuming that incremental weight updates are inherently low-rank, which often leads to a performance gap compared to full fine-tuning. While recent methods have attempted to address this limitation, they typically lack sufficient parameter and memory efficiency. We propose VectorFit, an effective and easily deployable approach that adaptively trains the singular vectors and biases of pre-trained weight matrices. We demonstrate that the utilization of structural and transformational characteristics of pre-trained weights enables high-rank updates comparable to those of full fine-tuning. As a result, VectorFit achieves superior performance with 9X less trainable parameters compared to state-of-the-art PEFT methods. Through extensive experiments over 17 datasets spanning diverse language and vision tasks such as natural language understanding and generation, question answering, image classification, and image generation, we exhibit that VectorFit consistently outperforms baselines, even in extremely low-budget scenarios.

Title: Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion

Authors: Haim Sawdayee, Chuan Guo, Guy Tevet, Bing Zhou, Jian Wang, Amit H. Bermano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19557
Pdf URL: https://arxiv.org/pdf/2503.19557
Copy Paste: [[2503.19557]] Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion(https://arxiv.org/abs/2503.19557)
Keywords: diffusion, generative
Abstract: Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a "Chicken" style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.

Title: Post-Hoc Calibrated Anomaly Detection

Authors: Sean Gloumeau
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.19577
Pdf URL: https://arxiv.org/pdf/2503.19577
Copy Paste: [[2503.19577]] Post-Hoc Calibrated Anomaly Detection(https://arxiv.org/abs/2503.19577)
Keywords: anomaly
Abstract: Deep unsupervised anomaly detection has seen improvements in a supervised binary classification paradigm in which auxiliary external data is included in the training set as anomalous data in a process referred to as outlier exposure, which opens the possibility of exploring the efficacy of post-hoc calibration for anomaly detection and localization. Post-hoc Platt scaling and Beta calibration are found to improve results with gradient-based input perturbation, as well as post-hoc training with a strictly proper loss of a base model initially trained on an unsupervised loss. Post-hoc calibration is also found at times to be more effective using random synthesized spectral data as labeled anomalous data in the calibration set, suggesting that outlier exposure is superior only for initial training.

Title: Video Anomaly Detection with Contours - A Study

Authors: Mia Siemon, Ivan Nikolov, Thomas B. Moeslund, Kamal Nasrollahi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19588
Pdf URL: https://arxiv.org/pdf/2503.19588
Copy Paste: [[2503.19588]] Video Anomaly Detection with Contours - A Study(https://arxiv.org/abs/2503.19588)
Keywords: anomaly
Abstract: In Pose-based Video Anomaly Detection prior art is rooted on the assumption that abnormal events can be mostly regarded as a result of uncommon human behavior. Opposed to utilizing skeleton representations of humans, however, we investigate the potential of learning recurrent motion patterns of normal human behavior using 2D contours. Keeping all advantages of pose-based methods, such as increased object anonymization, the shift from human skeletons to contours is hypothesized to leave the opportunity to cover more object categories open for future research. We propose formulating the problem as a regression and a classification task, and additionally explore two distinct data representation techniques for contours. To further reduce the computational complexity of Pose-based Video Anomaly Detection solutions, all methods in this study are based on shallow Neural Networks from the field of Deep Learning, and evaluated on the three most prominent benchmark datasets within Video Anomaly Detection and their human-related counterparts, totaling six datasets. Our results indicate that this novel perspective on Pose-based Video Anomaly Detection marks a promising direction for future research.

Title: Optimization through In-Context Learning and Iterative LLM Prompting for Nuclear Engineering Design Problems

Authors: M. Rizki Oktavian, Anirudh Tunga, Amandeep Bakshi, Michael J. Mueterthies, J. Thomas Gruenwald, Jonathan Nistor
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2503.19620
Pdf URL: https://arxiv.org/pdf/2503.19620
Copy Paste: [[2503.19620]] Optimization through In-Context Learning and Iterative LLM Prompting for Nuclear Engineering Design Problems(https://arxiv.org/abs/2503.19620)
Keywords: in-context
Abstract: The optimization of nuclear engineering designs, such as nuclear fuel assembly configurations, involves managing competing objectives like reactivity control and power distribution. This study explores the use of Optimization by Prompting, an iterative approach utilizing large language models (LLMs), to address these challenges. The method is straightforward to implement, requiring no hyperparameter tuning or complex mathematical formulations. Optimization problems can be described in plain English, with only an evaluator and a parsing script needed for execution. The in-context learning capabilities of LLMs enable them to understand problem nuances, therefore, they have the potential to surpass traditional metaheuristic optimization methods. This study demonstrates the application of LLMs as optimizers to Boiling Water Reactor (BWR) fuel lattice design, showing the capability of commercial LLMs to achieve superior optimization results compared to traditional methods.

Title: Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

Authors: Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Filip Janicki, Cristiano Malossi, Konrad Schindler, Roy Assaf
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19647
Pdf URL: https://arxiv.org/pdf/2503.19647
Copy Paste: [[2503.19647]] Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation(https://arxiv.org/abs/2503.19647)
Keywords: foundation model
Abstract: Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.

Title: OpenSDI: Spotting Diffusion-Generated Images in the Open World

Authors: Yabin Wang, Zhiwu Huang, Xiaopeng Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19653
Pdf URL: https://arxiv.org/pdf/2503.19653
Copy Paste: [[2503.19653]] OpenSDI: Spotting Diffusion-Generated Images in the Open World(https://arxiv.org/abs/2503.19653)
Keywords: diffusion, foundation model
Abstract: This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at this https URL.

Title: CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

Authors: Rupak Bose, Chinedu Innocent Nwoye, Aditya Bhat, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19661
Pdf URL: https://arxiv.org/pdf/2503.19661
Copy Paste: [[2503.19661]] CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation(https://arxiv.org/abs/2503.19661)
Keywords: diffusion, generative
Abstract: The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.

Title: PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models

Authors: Junhyuk So, Jiwoong Shin, Chaeyeon Jang, Eunhyeok Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19731
Pdf URL: https://arxiv.org/pdf/2503.19731
Copy Paste: [[2503.19731]] PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models(https://arxiv.org/abs/2503.19731)
Keywords: diffusion
Abstract: Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard iteration was introduced, effectively reducing sequential steps while ensuring exact convergence to the original output. Nonetheless, Picard iteration does not guarantee faster convergence, which can still result in slow generation in practice. In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. Inspired by the consistency model, PCM is directly trained to predict the fixed-point solution, or the final output, at any stage of the convergence trajectory. Additionally, we introduce a new concept called model switching, which addresses PCM's limitations and ensures exact convergence. Extensive experiments demonstrate that PCM achieves up to a 2.71x speedup over sequential sampling and a 1.77x speedup over Picard iteration across various tasks, including image generation and robotic control.

Title: FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Authors: Pihai Sun (1), Junjun Jiang (1), Yuanqi Yao (1), Youyu Chen (1), Wenbo Zhao (1), Kui Jiang (1), Xianming Liu (1) ((1) Faculty of Computing, Harbin Institute of Technology)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19739
Pdf URL: https://arxiv.org/pdf/2503.19739
Copy Paste: [[2503.19739]] FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion(https://arxiv.org/abs/2503.19739)
Keywords: self-supervised, foundation model
Abstract: Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground this http URL this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in this http URL on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: this https URL

Title: Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings

Authors: Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19740
Pdf URL: https://arxiv.org/pdf/2503.19740
Copy Paste: [[2503.19740]] Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings(https://arxiv.org/abs/2503.19740)
Keywords: self-supervised, foundation model
Abstract: Advancements in computer-assisted surgical procedures heavily rely on accurate visual data interpretation from camera systems used during surgeries. Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos with less than 100K images. To address these constraints, a new dataset called Surg-3M has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos and more than 3 million high-quality images from multiple procedure types, Surg-3M offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel tasks. To demonstrate the effectiveness of this dataset, we present SurgFM, a self-supervised foundation model pretrained on Surg-3M that achieves impressive results in downstream tasks such as surgical phase recognition, action recognition, and tool presence detection. Combining key components from ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM exhibits exceptional performance compared to specialist architectures across various benchmarks. Our experimental results show that SurgFM outperforms state-of-the-art models in multiple downstream tasks, including significant gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80). Moreover, even when using only half of the data, SurgFM outperforms state-of-the-art models in AutoLaparo and achieves state-of-the-art performance in Cholec80. Both Surg-3M and SurgFM have significant potential to accelerate progress towards developing autonomous robotic surgery systems.

Title: ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19755
Pdf URL: https://arxiv.org/pdf/2503.19755
Copy Paste: [[2503.19755]] ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation(https://arxiv.org/abs/2503.19755)
Keywords: generative
Abstract: End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Title: Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

Authors: Kartik Thakral, Tamar Glaser, Tal Hassner, Mayank Vatsa, Richa Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19783
Pdf URL: https://arxiv.org/pdf/2503.19783
Copy Paste: [[2503.19783]] Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models(https://arxiv.org/abs/2503.19783)
Keywords: diffusion, foundation model, generative
Abstract: Existing unlearning algorithms in text-to-image generative models often fail to preserve the knowledge of semantically related concepts when removing specific target concepts: a challenge known as adjacency. To address this, we propose FADE (Fine grained Attenuation for Diffusion Erasure), introducing adjacency aware unlearning in diffusion models. FADE comprises two components: (1) the Concept Neighborhood, which identifies an adjacency set of related concepts, and (2) Mesh Modules, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving atleast a 12% improvement in retention performance over state-of-the-art methods.

Title: SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation

Authors: Jingdan Kang, Haoxin Yang, Yan Cai, Huaidong Zhang, Xuemiao Xu, Yong Du, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19791
Pdf URL: https://arxiv.org/pdf/2503.19791
Copy Paste: [[2503.19791]] SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation(https://arxiv.org/abs/2503.19791)
Keywords: diffusion
Abstract: Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks. Current methods aimed at safeguarding artworks often employ adversarial attacks. However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork. To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image. This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process. Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead. The method's robust style feature disruption ensures high transferability across diverse models. Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image. This approach effectively protects against style extraction without compromising the visual quality of the artwork. Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation. It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. Code is available at this https URL.

Title: In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush

Authors: Vitaly Gnatyuk, Valeriia Koriukina Ilya Levoshevich, Pavel Nurminskiy, Guenter Wallner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19793
Pdf URL: https://arxiv.org/pdf/2503.19793
Copy Paste: [[2503.19793]] In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush(https://arxiv.org/abs/2503.19793)
Keywords: diffusion, generative
Abstract: With video games steadily increasing in complexity, automated generation of game content has found widespread interest. However, the task of 3D gaming map art creation remains underexplored to date due to its unique complexity and domain-specific challenges. While recent works have addressed related topics such as retro-style level generation and procedural terrain creation, these works primarily focus on simpler data distributions. To the best of our knowledge, we are the first to demonstrate the application of modern AI techniques for high-resolution texture manipulation in complex, highly detailed AAA 3D game environments. We introduce a novel Smart Brush for map editing, designed to assist artists in seamlessly modifying selected areas of a game map with minimal effort. By leveraging generative adversarial networks and diffusion models we propose two variants of the brush that enable efficient and context-aware generation. Our hybrid workflow aims to enhance both artistic flexibility and production efficiency, enabling the refinement of environments without manually reworking every detail, thus helping to bridge the gap between automation and creative control in game development. A comparative evaluation of our two methods with adapted versions of several state-of-the art models shows that our GAN-based brush produces the sharpest and most detailed outputs while preserving image context while the evaluated state-of-the-art models tend towards blurrier results and exhibit difficulties in maintaining contextual consistency.

Title: Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models

Authors: Ruixi You, Hecheng Jia, Feng Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19798
Pdf URL: https://arxiv.org/pdf/2503.19798
Copy Paste: [[2503.19798]] Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models(https://arxiv.org/abs/2503.19798)
Keywords: diffusion
Abstract: Synthetic Aperture Radar (SAR) imagery provides all-weather, all-day, and high-resolution imaging capabilities but its unique imaging mechanism makes interpretation heavily reliant on expert knowledge, limiting interpretability, especially in complex target tasks. Translating SAR images into optical images is a promising solution to enhance interpretation and support downstream tasks. Most existing research focuses on scene-level translation, with limited work on object-level translation due to the scarcity of paired data and the challenge of accurately preserving contour and texture details. To address these issues, this study proposes a keypoint-guided diffusion model (KeypointDiff) for SAR-to-optical image translation of unpaired aircraft targets. This framework introduces supervision on target class and azimuth angle via keypoints, along with a training strategy for unpaired data. Based on the classifier-free guidance diffusion architecture, a class-angle guidance module (CAGM) is designed to integrate class and angle information into the diffusion generation process. Furthermore, adversarial loss and consistency loss are employed to improve image fidelity and detail quality, tailored for aircraft targets. During sampling, aided by a pre-trained keypoint detector, the model eliminates the requirement for manually labeled class and azimuth information, enabling automated SAR-to-optical translation. Experimental results demonstrate that the proposed method outperforms existing approaches across multiple metrics, providing an efficient and effective solution for object-level SAR-to-optical translation and downstream tasks. Moreover, the method exhibits strong zero-shot generalization to untrained aircraft types with the assistance of the keypoint detector.

Title: SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI

Authors: Zhiyang Liu, Dong Yang, Minghao Zhang, Hanyu Sun, Hong Wu, Huiying Wang, Wen Shen, Chao Chai, Shuang Xia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19801
Pdf URL: https://arxiv.org/pdf/2503.19801
Copy Paste: [[2503.19801]] SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI(https://arxiv.org/abs/2503.19801)
Keywords: foundation model
Abstract: Despite that deep learning (DL) methods have presented tremendous potential in many medical image analysis tasks, the practical applications of medical DL models are limited due to the lack of enough data samples with manual annotations. By noting that the clinical radiology examinations are associated with radiology reports that describe the images, we propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings. In particular, a contrastive learning framework is proposed, where a mixed syntax and semantic similarity matching metric is integrated to reduce the thirst of extreme large dataset in conventional contrastive learning framework. Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features. Experiments revealed that our proposed SeLIP performs well in many downstream tasks including image-text retrieval task, classification task, and image segmentation, which highlights the importance of considering the similarities among texts describing different images in developing medical image foundation models.

Title: Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning

Authors: Pratibha Kumari, Afshin Bozorgpour, Daniel Reisenbüchler, Edgar Jost, Martina Crysandt, Christian Matek, Dorit Merhof
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19819
Pdf URL: https://arxiv.org/pdf/2503.19819
Copy Paste: [[2503.19819]] Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning(https://arxiv.org/abs/2503.19819)
Keywords: foundation model, generative
Abstract: White blood cell (WBC) classification plays a vital role in hematology for diagnosing various medical conditions. However, it faces significant challenges due to domain shifts caused by variations in sample sources (e.g., blood or bone marrow) and differing imaging conditions across hospitals. Traditional deep learning models often suffer from catastrophic forgetting in such dynamic environments, while foundation models, though generally robust, experience performance degradation when the distribution of inference data differs from that of the training data. To address these challenges, we propose a generative replay-based Continual Learning (CL) strategy designed to prevent forgetting in foundation models for WBC classification. Our method employs lightweight generators to mimic past data with a synthetic latent representation to enable privacy-preserving replay. To showcase the effectiveness, we carry out extensive experiments with a total of four datasets with different task ordering and four backbone models including ResNet50, RetCCL, CTransPath, and UNI. Experimental results demonstrate that conventional fine-tuning methods degrade performance on previously learned tasks and struggle with domain shifts. In contrast, our continual learning strategy effectively mitigates catastrophic forgetting, preserving model performance across varying domains. This work presents a practical solution for maintaining reliable WBC classification in real-world clinical settings, where data distributions frequently evolve.

Title: FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Authors: Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19839
Pdf URL: https://arxiv.org/pdf/2503.19839
Copy Paste: [[2503.19839]] FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model(https://arxiv.org/abs/2503.19839)
Keywords: diffusion
Abstract: Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Specifically, we enhance the fine-grained visual perception capabilities of the VLM by introducing additional region tokens. Relying solely on the output of the LLM to guide the diffusion model may lead to suboptimal editing results. Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods. Our project is available at this https URL.

Title: An Overview of Low-Rank Structures in the Training and Adaptation of Large Models

Authors: Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, Can Yaras
Subjects: cs.LG, eess.SP, math.OC, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2503.19859
Pdf URL: https://arxiv.org/pdf/2503.19859
Copy Paste: [[2503.19859]] An Overview of Low-Rank Structures in the Training and Adaptation of Large Models(https://arxiv.org/abs/2503.19859)
Keywords: self-supervised
Abstract: The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures in weight matrices and learned representations during training. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. Practical techniques inspired by this phenomenon, such as low-rank adaptation (LoRA) and training, enable significant reductions in computational cost while preserving model performance. In this paper, we present a comprehensive review of recent advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations. Mathematically, we present two complementary perspectives on understanding the low-rankness in deep networks: (i) the emergence of low-rank structures throughout the whole optimization dynamics of gradient and (ii) the implicit regularization effects that induce such low-rank structures at convergence. From a practical standpoint, studying the low-rank learning dynamics of gradient descent offers a mathematical foundation for understanding the effectiveness of LoRA in fine-tuning large-scale models and inspires parameter-efficient low-rank training strategies. Furthermore, the implicit low-rank regularization effect helps explain the success of various masked training approaches in deep neural networks, ranging from dropout to masked self-supervised learning.

Title: Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Authors: Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19881
Pdf URL: https://arxiv.org/pdf/2503.19881
Copy Paste: [[2503.19881]] Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation(https://arxiv.org/abs/2503.19881)
Keywords: diffusion
Abstract: Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is this https URL.

Title: Scaling Down Text Encoders of Text-to-Image Diffusion Models

Authors: Lifu Wang, Daqing Liu, Xinchen Liu, Xiaodong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19897
Pdf URL: https://arxiv.org/pdf/2503.19897
Copy Paste: [[2503.19897]] Scaling Down Text Encoders of Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.19897)
Keywords: diffusion
Abstract: Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

Title: CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Authors: Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hanchao Yu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.19900
Pdf URL: https://arxiv.org/pdf/2503.19900
Copy Paste: [[2503.19900]] CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning(https://arxiv.org/abs/2503.19900)
Keywords: generative
Abstract: The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

Title: ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

Authors: Fernando Julio Cendra, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19902
Pdf URL: https://arxiv.org/pdf/2503.19902
Copy Paste: [[2503.19902]] ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models(https://arxiv.org/abs/2503.19902)
Keywords: diffusion, generative
Abstract: The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilizes a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: this https URL

Title: AvatarArtist: Open-Domain 4D Avatarization

Authors: Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19906
Pdf URL: https://arxiv.org/pdf/2503.19906
Copy Paste: [[2503.19906]] AvatarArtist: Open-Domain 4D Avatarization(https://arxiv.org/abs/2503.19906)
Keywords: diffusion, generative
Abstract: This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies..

Title: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Authors: Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19907
Pdf URL: https://arxiv.org/pdf/2503.19907
Copy Paste: [[2503.19907]] FullDiT: Multi-Task Video Generative Foundation Model with Full Attention(https://arxiv.org/abs/2503.19907)
Keywords: foundation model, generative
Abstract: Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Title: SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining

Authors: Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.19912
Pdf URL: https://arxiv.org/pdf/2503.19912
Copy Paste: [[2503.19912]] SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining(https://arxiv.org/abs/2503.19912)
Keywords: foundation model
Abstract: LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at this https URL

Title: PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model

Authors: Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19913
Pdf URL: https://arxiv.org/pdf/2503.19913
Copy Paste: [[2503.19913]] PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model(https://arxiv.org/abs/2503.19913)
Keywords: diffusion
Abstract: As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model's understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models are publicly available to facilitate future research.

Title: Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Authors: Sangwon Beak, Hyeonwoo Kim, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19914
Pdf URL: https://arxiv.org/pdf/2503.19914
Copy Paste: [[2503.19914]] Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models(https://arxiv.org/abs/2503.19914)
Keywords: diffusion
Abstract: We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations and preventing object collisions. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.