2025-03-14

Title: Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference

Authors: Songlin Yang, Tao Yang, Bo Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09646
Pdf URL: https://arxiv.org/pdf/2503.09646
Copy Paste: [[2503.09646]] Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference(https://arxiv.org/abs/2503.09646)
Keywords: diffusion
Abstract: The deployment of sensors for air quality monitoring is constrained by high costs, leading to inadequate network coverage and data deficits in some areas. Utilizing existing observations, spatio-temporal kriging is a method for estimating air quality at unobserved locations during a specific period. Inductive spatio-temporal kriging with increment training strategy has demonstrated its effectiveness using virtual nodes to simulate unobserved nodes. However, a disparity between virtual and real nodes persists, complicating the application of learning patterns derived from virtual nodes to actual unobserved ones. To address these limitations, this paper presents a Physics-Guided Increment Training Strategy (PGITS). Specifically, we design a dynamic graph generation module to incorporate the advection and diffusion processes of airborne particles as physical knowledge into the graph structure, dynamically adjusting the adjacency matrix to reflect physical interactions between nodes. By using physics principles as a bridge between virtual and real nodes, this strategy ensures the features of virtual nodes and their pseudo labels are closer to actual nodes. Consequently, the learned patterns of virtual nodes can be applied to actual unobserved nodes for effective kriging.

Title: LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics

Authors: Jialiang Tang, Shuo Chen, Chen Gong, Jing Zhang, Dacheng Tao
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.09656
Pdf URL: https://arxiv.org/pdf/2503.09656
Copy Paste: [[2503.09656]] LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics(https://arxiv.org/abs/2503.09656)
Keywords: in-context
Abstract: Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring. Recent studies have revealed that Large Language Models (LLMs), with their powerful in-contextual modeling capabilities, hold significant potential for TSF. However, existing LLM-based methods usually perform suboptimally because they neglect the inherent characteristics of time series data. Unlike the textual data used in LLM pre-training, the time series data is semantically sparse and comprises distinctive temporal patterns. To address this problem, we propose LLM-PS to empower the LLM for TSF by learning the fundamental \textit{Patterns} and meaningful \textit{Semantics} from time series data. Our LLM-PS incorporates a new multi-scale convolutional neural network adept at capturing both short-term fluctuations and long-term trends within the time series. Meanwhile, we introduce a time-to-text module for extracting valuable semantics across continuous time intervals rather than isolated time points. By integrating these patterns and semantics, LLM-PS effectively models temporal dependencies, enabling a deep comprehension of time series and delivering accurate forecasts. Intensive experimental results demonstrate that LLM-PS achieves state-of-the-art performance in both short- and long-term forecasting tasks, as well as in few- and zero-shot settings.

Title: CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Authors: Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09662
Pdf URL: https://arxiv.org/pdf/2503.09662
Copy Paste: [[2503.09662]] CoRe^2: Collect, Reflect and Refine to Generate Better and Faster(https://arxiv.org/abs/2503.09662)
Keywords: diffusion, generative
Abstract: Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using this http URL is released at this https URL.

Title: Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models

Authors: Sangwon Jang, June Suk Choi, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09669
Pdf URL: https://arxiv.org/pdf/2503.09669
Copy Paste: [[2503.09669]] Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.09669)
Keywords: diffusion
Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality contents from text prompts. However, their reliance on publicly available data and the growing trend of data sharing for fine-tuning make these models particularly vulnerable to data poisoning attacks. In this work, we introduce the Silent Branding Attack, a novel data poisoning method that manipulates text-to-image diffusion models to generate images containing specific brand logos or symbols without any text triggers. We find that when certain visual patterns are repeatedly in the training data, the model learns to reproduce them naturally in its outputs, even without prompt mentions. Leveraging this, we develop an automated data poisoning algorithm that unobtrusively injects logos into original images, ensuring they blend naturally and remain undetected. Models trained on this poisoned dataset generate images containing logos without degrading image quality or text alignment. We experimentally validate our silent branding attack across two realistic settings on large-scale high-quality image datasets and style personalization datasets, achieving high success rates even without a specific text trigger. Human evaluation and quantitative metrics including logo detection show that our method can stealthily embed logos.

Title: Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

Authors: Shangwen Zhu, Han Zhang, Zhantao Yang, Qianyu Peng, Zhao Pu, Huangji Wang, Fan Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09675
Pdf URL: https://arxiv.org/pdf/2503.09675
Copy Paste: [[2503.09675]] Accelerating Diffusion Sampling via Exploiting Local Transition Coherence(https://arxiv.org/abs/2503.09675)
Keywords: diffusion
Abstract: Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of 1.67-fold in Stable Diffusion v2 and a speedup of 1.55-fold in video generation models. When combined with distillation models, LTC-Accel achieves a remarkable 10-fold speedup in video generation, allowing real-time generation of more than 16FPS.

Title: DRESS: Disentangled Representation-based Self-Supervised Meta-Learning for Diverse Tasks

Authors: Wei Cui, Tongzi Wu, Jesse C. Cresswell, Yi Sui, Keyvan Golestan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.09679
Pdf URL: https://arxiv.org/pdf/2503.09679
Copy Paste: [[2503.09679]] DRESS: Disentangled Representation-based Self-Supervised Meta-Learning for Diverse Tasks(https://arxiv.org/abs/2503.09679)
Keywords: self-supervised
Abstract: Meta-learning represents a strong class of approaches for solving few-shot learning tasks. Nonetheless, recent research suggests that simply pre-training a generic encoder can potentially surpass meta-learning algorithms. In this paper, we first discuss the reasons why meta-learning fails to stand out in these few-shot learning experiments, and hypothesize that it is due to the few-shot learning tasks lacking diversity. We propose DRESS, a task-agnostic Disentangled REpresentation-based Self-Supervised meta-learning approach that enables fast model adaptation on highly diversified few-shot learning tasks. Specifically, DRESS utilizes disentangled representation learning to create self-supervised tasks that can fuel the meta-training process. Furthermore, we also propose a class-partition based metric for quantifying the task diversity directly on the input space. We validate the effectiveness of DRESS through experiments on datasets with multiple factors of variation and varying complexity. The results suggest that DRESS is able to outperform competing methods on the majority of the datasets and task setups. Through this paper, we advocate for a re-examination of proper setups for task adaptation studies, and aim to reignite interest in the potential of meta-learning for solving few-shot learning tasks via disentangled representations.

Title: Revisiting semi-supervised learning in the era of foundation models

Authors: Ping Zhang, Zheda Mai, Quang-Huy Nguyen, Wei-Lun Chao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.09707
Pdf URL: https://arxiv.org/pdf/2503.09707
Copy Paste: [[2503.09707]] Revisiting semi-supervised learning in the era of foundation models(https://arxiv.org/abs/2503.09707)
Keywords: foundation model
Abstract: Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.

Title: Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain

Authors: Yuanmin Huang, Mi Zhang, Zhaoxiang Wang, Wenxuan Li, Min Yang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09712
Pdf URL: https://arxiv.org/pdf/2503.09712
Copy Paste: [[2503.09712]] Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain(https://arxiv.org/abs/2503.09712)
Keywords: generative
Abstract: Time series classification (TSC) is a cornerstone of modern web applications, powering tasks such as financial data analysis, network traffic monitoring, and user behavior analysis. In recent years, deep neural networks (DNNs) have greatly enhanced the performance of TSC models in these critical domains. However, DNNs are vulnerable to backdoor attacks, where attackers can covertly implant triggers into models to induce malicious outcomes. Existing backdoor attacks targeting DNN-based TSC models remain elementary. In particular, early methods borrow trigger designs from computer vision, which are ineffective for time series data. More recent approaches utilize generative models for trigger generation, but at the cost of significant computational complexity. In this work, we analyze the limitations of existing attacks and introduce an enhanced method, FreqBack. Drawing inspiration from the fact that DNN models inherently capture frequency domain features in time series data, we identify that improper perturbations in the frequency domain are the root cause of ineffective attacks. To address this, we propose to generate triggers both effectively and efficiently, guided by frequency analysis. FreqBack exhibits substantial performance across five models and eight datasets, achieving an impressive attack success rate of over 90%, while maintaining less than a 3% drop in model accuracy on clean data.

Title: The Pitfalls of Imitation Learning when Actions are Continuous

Authors: Max Simchowitz, Daniel Pfrommer, Ali Jadbabaie
Subjects: cs.LG, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09722
Pdf URL: https://arxiv.org/pdf/2503.09722
Copy Paste: [[2503.09722]] The Pitfalls of Imitation Learning when Actions are Continuous(https://arxiv.org/abs/2503.09722)
Keywords: diffusion
Abstract: We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics are stable (i.e. contracting exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly "improper" imitator policies--those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity--or unless the expert trajectory distribution is sufficiently "spread." We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion Policies). We also establish a host of complementary negative and positive results for imitation in control systems.

Title: I2V3D: Controllable image-to-video generation with 3D guidance

Authors: Zhiyuan Zhang, Dongdong Chen, Jing Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09733
Pdf URL: https://arxiv.org/pdf/2503.09733
Copy Paste: [[2503.09733]] I2V3D: Controllable image-to-video generation with 3D guidance(https://arxiv.org/abs/2503.09733)
Keywords: diffusion, generative
Abstract: We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

Title: Solving Bayesian inverse problems with diffusion priors and off-policy RL

Authors: Luca Scimeca, Siddarth Venkatraman, Moksh Jain, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector-Brooks, Yashar Hezaveh, Laurence Perreault-Levasseur, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09746
Pdf URL: https://arxiv.org/pdf/2503.09746
Copy Paste: [[2503.09746]] Solving Bayesian inverse problems with diffusion priors and off-policy RL(https://arxiv.org/abs/2503.09746)
Keywords: diffusion
Abstract: This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

Title: BiasConnect: Investigating Bias Interactions in Text-to-Image Models

Authors: Pushkar Shukla, Aditya Chinchure, Emily Diana, Alexander Tolbert, Kartik Hosanagar, Vineeth N. Balasubramanian, Leonid Sigal, Matthew A. Turk
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09763
Pdf URL: https://arxiv.org/pdf/2503.09763
Copy Paste: [[2503.09763]] BiasConnect: Investigating Bias Interactions in Text-to-Image Models(https://arxiv.org/abs/2503.09763)
Keywords: generative
Abstract: The biases exhibited by Text-to-Image (TTI) models are often treated as if they are independent, but in reality, they may be deeply interrelated. Addressing bias along one dimension, such as ethnicity or age, can inadvertently influence another dimension, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. In this paper, we aim to address these questions by introducing BiasConnect, a novel tool designed to analyze and quantify bias interactions in TTI models. Our approach leverages a counterfactual-based framework to generate pairwise causal graphs that reveals the underlying structure of bias interactions for the given text prompt. Additionally, our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified. Our estimates have a strong correlation (+0.69) with the interdependency observations post bias mitigation. We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.

Title: Constrained Language Generation with Discrete Diffusion Models

Authors: Michael Cardei, Jacob K Christopher, Thomas Hartvigsen, Brian R. Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09790
Pdf URL: https://arxiv.org/pdf/2503.09790
Copy Paste: [[2503.09790]] Constrained Language Generation with Discrete Diffusion Models(https://arxiv.org/abs/2503.09790)
Keywords: diffusion
Abstract: Constraints are critical in text generation as LLM outputs are often unreliable when it comes to ensuring generated outputs adhere to user defined instruction or general safety guidelines. To address this gap, we present Constrained Discrete Diffusion (CDD), a novel method for enforcing constraints on natural language by integrating discrete diffusion models with differentiable optimization. Unlike conventional text generators, which often rely on post-hoc filtering or model retraining for controllable generation, we propose imposing constraints directly into the discrete diffusion sampling process. We illustrate how this technique can be applied to satisfy a variety of natural language constraints, including (i) toxicity mitigation by preventing harmful content from emerging, (ii) character and sequence level lexical constraints, and (iii) novel molecule sequence generation with specific property adherence. Experimental results show that our constraint-aware procedure achieves high fidelity in meeting these requirements while preserving fluency and semantic coherence, outperforming auto-regressive and existing discrete diffusion approaches.

Title: Temporal Difference Flows

Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, Ahmed Touati
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09817
Pdf URL: https://arxiv.org/pdf/2503.09817
Copy Paste: [[2503.09817]] Temporal Difference Flows(https://arxiv.org/abs/2503.09817)
Keywords: diffusion, foundation model, generative
Abstract: Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

Title: Generative AI for Named Entity Recognition in Low-Resource Language Nepali

Authors: Sameer Neupane (University of Memphis), Jeevan Chapagain (University of Memphis), Nobal B. Niraula (Nowa Lab), Diwa Koirala (Nowa Lab)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09822
Pdf URL: https://arxiv.org/pdf/2503.09822
Copy Paste: [[2503.09822]] Generative AI for Named Entity Recognition in Low-Resource Language Nepali(https://arxiv.org/abs/2503.09822)
Keywords: generative
Abstract: Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), has significantly advanced Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), which involves identifying entities like person, location, and organization names in text. LLMs are especially promising for low-resource languages due to their ability to learn from limited data. However, the performance of GenAI models for Nepali, a low-resource language, has not been thoroughly evaluated. This paper investigates the application of state-of-the-art LLMs for Nepali NER, conducting experiments with various prompting techniques to assess their effectiveness. Our results provide insights into the challenges and opportunities of using LLMs for NER in low-resource settings and offer valuable contributions to the advancement of NLP research in languages like Nepali.

Title: Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

Authors: Wenyi Lian, Joakim Lindblad, Patrick Micke, Nataša Sladoje
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09826
Pdf URL: https://arxiv.org/pdf/2503.09826
Copy Paste: [[2503.09826]] Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning(https://arxiv.org/abs/2503.09826)
Keywords: foundation model
Abstract: Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data.

Title: Resolution Invariant Autoencoder

Authors: Ashay Patel, Michela Antonelli, Sebastien Ourselin, M. Jorge Cardoso
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.09828
Pdf URL: https://arxiv.org/pdf/2503.09828
Copy Paste: [[2503.09828]] Resolution Invariant Autoencoder(https://arxiv.org/abs/2503.09828)
Keywords: generative
Abstract: Deep learning has significantly advanced medical imaging analysis, yet variations in image resolution remain an overlooked challenge. Most methods address this by resampling images, leading to either information loss or computational inefficiencies. While solutions exist for specific tasks, no unified approach has been proposed. We introduce a resolution-invariant autoencoder that adapts spatial resizing at each layer in the network via a learned variable resizing process, replacing fixed spatial down/upsampling at the traditional factor of 2. This ensures a consistent latent space resolution, regardless of input or output resolution. Our model enables various downstream tasks to be performed on an image latent whilst maintaining performance across different resolutions, overcoming the shortfalls of traditional methods. We demonstrate its effectiveness in uncertainty-aware super-resolution, classification, and generative modelling tasks and show how our method outperforms conventional baselines with minimal performance loss across resolutions.

Title: Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation

Authors: Feng Zhou, Pu Cao, Yiyang Ma, Lu Yang, Jianqin Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09830
Pdf URL: https://arxiv.org/pdf/2503.09830
Copy Paste: [[2503.09830]] Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation(https://arxiv.org/abs/2503.09830)
Keywords: diffusion, generative
Abstract: Denoising higher-resolution latents via a pre-trained U-Net leads to repetitive and disordered image patterns. Although recent studies make efforts to improve generative quality by aligning denoising process across original and higher resolutions, the root cause of suboptimal generation is still lacking exploration. Through comprehensive analysis of position encoding in U-Net, we attribute it to inconsistent position encoding, sourced by the inadequate propagation of position information from zero-padding to latent features in convolution layers as resolution increases. To address this issue, we propose a novel training-free approach, introducing a Progressive Boundary Complement (PBC) method. This method creates dynamic virtual image boundaries inside the feature map to enhance position information propagation, enabling high-quality and rich-content high-resolution image synthesis. Extensive experiments demonstrate the superiority of our method.

Title: Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis

Authors: Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, Jianming Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09860
Pdf URL: https://arxiv.org/pdf/2503.09860
Copy Paste: [[2503.09860]] Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis(https://arxiv.org/abs/2503.09860)
Keywords: foundation model
Abstract: Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at this https URL.

Title: Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval

Authors: Stefan Sylvius Wagner, Stefan Harmeling
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09867
Pdf URL: https://arxiv.org/pdf/2503.09867
Copy Paste: [[2503.09867]] Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval(https://arxiv.org/abs/2503.09867)
Keywords: self-supervised
Abstract: Object-centric learning is fundamental to human vision and crucial for models requiring complex reasoning. Traditional approaches rely on slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. However, DINO representations primarily capture global scene features, often confounding individual object attributes. We investigate the effectiveness of DINO representations and slot-based methods for multi-object instance retrieval. Our findings reveal that DINO representations excel at capturing global object attributes such as object shape and size, but struggle with object-level details like colour, whereas slot-based representations struggle at both global and object-level understanding. To address this, we propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors from a Variational Autoencoder trained on segmented image patches that are extracted from the DINO features. This approach improves multi-object instance retrieval performance, bridging the gap between global scene understanding and fine-grained object representation without requiring full model retraining.

Title: CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

Authors: Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.09878
Pdf URL: https://arxiv.org/pdf/2503.09878
Copy Paste: [[2503.09878]] CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation(https://arxiv.org/abs/2503.09878)
Keywords: self-supervised, foundation model
Abstract: Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

Title: Inter-environmental world modeling for continuous and compositional dynamics

Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09911
Pdf URL: https://arxiv.org/pdf/2503.09911
Copy Paste: [[2503.09911]] Inter-environmental world modeling for continuous and compositional dynamics(https://arxiv.org/abs/2503.09911)
Keywords: generative
Abstract: Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Title: Type Information-Assisted Self-Supervised Knowledge Graph Denoising

Authors: Jiaqi Sun, Yujia Zheng, Xinshuai Dong, Haoyue Dai, Kun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09916
Pdf URL: https://arxiv.org/pdf/2503.09916
Copy Paste: [[2503.09916]] Type Information-Assisted Self-Supervised Knowledge Graph Denoising(https://arxiv.org/abs/2503.09916)
Keywords: self-supervised
Abstract: Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize type inconsistency noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. Experimental validation demonstrates the effectiveness of our approach in detecting potential noise in real-world data.

Title: VideoMerge: Towards Training-free Long Video Generation

Authors: Siyang Zhang, Harry Yang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09926
Pdf URL: https://arxiv.org/pdf/2503.09926
Copy Paste: [[2503.09926]] VideoMerge: Towards Training-free Long Video Generation(https://arxiv.org/abs/2503.09926)
Keywords: diffusion
Abstract: Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model's original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.

Title: PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

Authors: Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin
Subjects: cs.CV, cs.MM, cs.RO
Abstract URL: https://arxiv.org/abs/2503.09938
Pdf URL: https://arxiv.org/pdf/2503.09938
Copy Paste: [[2503.09938]] PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation(https://arxiv.org/abs/2503.09938)
Keywords: diffusion
Abstract: Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.

Title: A Chaotic Image Encryption Scheme Using Novel Geometric Block Permutation and Dynamic Substitution

Authors: Muhammad Ali, Jawad Ahmad, Muhammad Abdullah Hussain Khan, Safee Ullah, Mujeeb Ur Rehman, Syed Aziz Shah, Muhammad Shahbaz Khan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.09939
Pdf URL: https://arxiv.org/pdf/2503.09939
Copy Paste: [[2503.09939]] A Chaotic Image Encryption Scheme Using Novel Geometric Block Permutation and Dynamic Substitution(https://arxiv.org/abs/2503.09939)
Keywords: diffusion
Abstract: In this digital era, ensuring the security of digital data during transmission and storage is crucial. Digital data, particularly image data, needs to be protected against unauthorized access. To address this, this paper presents a novel image encryption scheme based on a confusion diffusion architecture. The diffusion module introduces a novel geometric block permutation technique, which effectively scrambles the pixels based on geometric shape extraction of pixels. The image is converted into four blocks, and pixels are extracted from these blocks using L-shape, U-shape, square-shape, and inverted U-shape patterns for each block, respectively. This robust extraction and permutation effectively disrupts the correlation within the image. Furthermore, the confusion module utilises bit-XOR and dynamic substitution techniques. For the bit-XOR operation, 2D Henon map has been utilised to generate a chaotic seed matrix, which is bit-XORed with the scrambled image. The resultant image then undergoes the dynamic substitution process to complete confusion phase. A statistical security analysis demonstrates the superior security of the proposed scheme, with being high uncertainty and unpredictability, achieving an entropy of 7.9974 and a correlation coefficient of 0.0014. These results validate the proposed scheme's effectiveness in securing digital images.

Title: Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

Authors: Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09942
Pdf URL: https://arxiv.org/pdf/2503.09942
Copy Paste: [[2503.09942]] Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers(https://arxiv.org/abs/2503.09942)
Keywords: diffusion
Abstract: Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.

Title: UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Authors: Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09949
Pdf URL: https://arxiv.org/pdf/2503.09949
Copy Paste: [[2503.09949]] UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?(https://arxiv.org/abs/2503.09949)
Keywords: generative
Abstract: With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at this https URL.

Title: X-Cross: Image Encryption Featuring Novel Dual-Layer Block Permutation and Dynamic Substitution Techniques

Authors: Hansa Ahsan, Safee Ullah, Jawad Ahmad, Aizaz Ahmad Khattak, Muhammad Ali, Muhammad Shahbaz Khan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.09953
Pdf URL: https://arxiv.org/pdf/2503.09953
Copy Paste: [[2503.09953]] X-Cross: Image Encryption Featuring Novel Dual-Layer Block Permutation and Dynamic Substitution Techniques(https://arxiv.org/abs/2503.09953)
Keywords: diffusion
Abstract: In this digital age, ensuring the security of digital data, especially the image data is critically important. Image encryption plays an important role in securing the online transmission/storage of images from unauthorized access. In this regard, this paper presents a novel diffusion-confusion-based image encryption algorithm named as X-CROSS. The diffusion phase involves a dual-layer block permutation. It involves a bit-level permutation termed Inter-Bit Transference (IBT) using a Bit-Extraction key, and pixel permutation with a unique X-crosspermutation algorithm to effectively scramble the pixels within an image. The proposed algorithm utilizes a resilient 2D chaotic map with non-linear dynamical behavior, assisting in generating complex Extraction Keys. After the permutation phase, the confusion phase proceeds with a dynamic substitution technique on the permuted images, establishing the final encryption layer. This combination of novel permutation and confusion results in the removal of the image's inherent patterns and increases its resistance to cyber-attacks. The close to ideal statistical security results for information entropy, correlation, homogeneity, contrast, and energy validate the proposed scheme's effectiveness in hiding the information within the image.

Title: Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment

Authors: Zhenyu Liu, Dongfang Li, Xinshuo Hu, Xinping Zhao, Yibin Chen, Baotian Hu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.09958
Pdf URL: https://arxiv.org/pdf/2503.09958
Copy Paste: [[2503.09958]] Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment(https://arxiv.org/abs/2503.09958)
Keywords: in-context
Abstract: Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become this http URL by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further this http URL experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45+) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at this https URL.

Title: Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes

Authors: JunYong Choi, Min-Cheol Sagong, SeokYeong Lee, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09993
Pdf URL: https://arxiv.org/pdf/2503.09993
Copy Paste: [[2503.09993]] Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes(https://arxiv.org/abs/2503.09993)
Keywords: diffusion, generative
Abstract: We propose a diffusion-based inverse rendering framework that decomposes a single RGB image into geometry, material, and lighting. Inverse rendering is inherently ill-posed, making it difficult to predict a single accurate solution. To address this challenge, recent generative model-based methods aim to present a range of possible solutions. However, finding a single accurate solution and generating diverse solutions can be conflicting. In this paper, we propose a channel-wise noise scheduling approach that allows a single diffusion model architecture to achieve two conflicting objectives. The resulting two diffusion models, trained with different channel-wise noise schedules, can predict a single highly accurate solution and present multiple possible solutions. The experimental results demonstrate the superiority of our two models in terms of both diversity and accuracy, which translates to enhanced performance in downstream applications such as object insertion and material editing.

Title: Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models

Authors: Sina Malakouti, Adriana Kovashka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10037
Pdf URL: https://arxiv.org/pdf/2503.10037
Copy Paste: [[2503.10037]] Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.10037)
Keywords: diffusion
Abstract: Text-to-image diffusion models consistently fail at generating counter-stereotypical action relationships (e.g., "mouse chasing cat"), defaulting to frequent stereotypes even when explicitly prompted otherwise. Through systematic investigation, we discover this limitation stems from distributional biases rather than inherent model constraints. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"). To test this hypothesis, we develop a Role-Bridging Decomposition framework that leverages these intermediates to gradually teach rare relationships without architectural modifications. We introduce ActionBench, a comprehensive benchmark specifically designed to evaluate action-based relationship generation across stereotypical and counter-stereotypical configurations. Our experiments validate that intermediate compositions indeed facilitate counter-stereotypical generation, with both automatic metrics and human evaluations showing significant improvements over existing approaches. This work not only identifies fundamental biases in current text-to-image systems but demonstrates a promising direction for addressing them through compositional reasoning.

Title: Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations

Authors: Ho Hin Lee, Alberto Santamaria-Pang, Jameson Merkov, Matthew Lungren, Ivan Tarapov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10057
Pdf URL: https://arxiv.org/pdf/2503.10057
Copy Paste: [[2503.10057]] Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations(https://arxiv.org/abs/2503.10057)
Keywords: foundation model
Abstract: Accurate survival prediction in oncology requires integrating diverse imaging modalities to capture the complex interplay of tumor biology. Traditional single-modality approaches often fail to leverage the complementary insights provided by radiological and pathological assessments. In this work, we introduce M4Survive (Multi-Modal Mamba Modeling for Survival Prediction), a novel framework that learns joint foundation model representations using efficient adapter networks. Our approach dynamically fuses heterogeneous embeddings from a foundation model repository (e.g., MedImageInsight, BiomedCLIP, Prov-GigaPath, UNI2-h), creating a correlated latent space optimized for survival risk estimation. By leveraging Mamba-based adapters, M4Survive enables efficient multi-modal learning while preserving computational efficiency. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms both unimodal and traditional static multi-modal baselines in survival prediction accuracy. This work underscores the potential of foundation model-driven multi-modal fusion in advancing precision oncology and predictive analytics.

Title: Provably Secure Covert Messaging Using Image-based Diffusion Processes

Authors: Luke A. Bauer, Wenxuan Bao, Vincent Bindschaedler
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10063
Pdf URL: https://arxiv.org/pdf/2503.10063
Copy Paste: [[2503.10063]] Provably Secure Covert Messaging Using Image-based Diffusion Processes(https://arxiv.org/abs/2503.10063)
Keywords: diffusion
Abstract: We consider the problem of securely and robustly embedding covert messages into an image-based diffusion model's output. The sender and receiver want to exchange the maximum amount of information possible per diffusion sampled image while remaining undetected. The adversary wants to detect that such communication is taking place by identifying those diffusion samples that contain covert messages. To maximize robustness to transformations of the diffusion sample, a strategy is for the sender and the receiver to embed the message in the initial latents. We first show that prior work that attempted this is easily broken because their embedding technique alters the latents' distribution. We then propose a straightforward method to embed covert messages in the initial latent {\em without} altering the distribution. We prove that our construction achieves indistinguishability to any probabilistic polynomial time adversary. Finally, we discuss and analyze empirically the tradeoffs between embedding capacity, message recovery rates, and robustness. We find that optimizing the inversion method for error correction is crucial for reliability.

Title: Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection

Authors: Zhen Qu, Xian Tao, Xinyi Gong, Shichen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10080
Pdf URL: https://arxiv.org/pdf/2503.10080
Copy Paste: [[2503.10080]] Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection(https://arxiv.org/abs/2503.10080)
Keywords: anomaly
Abstract: Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limit generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and enhance the model's generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-attention (RCA) module is introduced to better align dynamic text embeddings with fine-grained image features. Extensive experiments on 15 industrial and medical datasets demonstrate our method's superior performance.

Title: AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption

Authors: Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-eui Yoon
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10081
Pdf URL: https://arxiv.org/pdf/2503.10081
Copy Paste: [[2503.10081]] AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption(https://arxiv.org/abs/2503.10081)
Keywords: diffusion, generative
Abstract: The outstanding capability of diffusion models in generating high-quality images poses significant threats when misused by adversaries. In particular, we assume malicious adversaries exploiting diffusion models for inpainting tasks, such as replacing a specific region with a celebrity. While existing methods for protecting images from manipulation in diffusion-based generative models have primarily focused on image-to-image and text-to-image tasks, the challenge of preventing unauthorized inpainting has been rarely addressed, often resulting in suboptimal protection performance. To mitigate inpainting abuses, we propose ADVPAINT, a novel defensive framework that generates adversarial perturbations that effectively disrupt the adversary's inpainting tasks. ADVPAINT targets the self- and cross-attention blocks in a target diffusion inpainting model to distract semantic understanding and prompt interactions during image generation. ADVPAINT also employs a two-stage perturbation strategy, dividing the perturbation region based on an enlarged bounding box around the object, enhancing robustness across diverse masks of varying shapes and sizes. Our experimental results demonstrate that ADVPAINT's perturbations are highly effective in disrupting the adversary's inpainting tasks, outperforming existing methods; ADVPAINT attains over a 100-point increase in FID and substantial decreases in precision.

Title: Semantic Latent Motion for Portrait Video Generation

Authors: Qiyuan Zhang, Chenyu Wu, Wenzhang Sun, Huaize Liu, Donglin Di, Wei Chen, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10096
Pdf URL: https://arxiv.org/pdf/2503.10096
Copy Paste: [[2503.10096]] Semantic Latent Motion for Portrait Video Generation(https://arxiv.org/abs/2503.10096)
Keywords: self-supervised, generative
Abstract: Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generation models, which may introduce unrealistic motion and lead to inefficient inference. To address these challenges, we propose Semantic Latent Motion (SeMo), a compact and expressive motion representation. Leveraging this representation, our approach achieve both high-quality visual results and efficient inference. SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation. First, in the Abstraction step, we use a carefully designed Mask Motion Encoder to compress the subject's motion state into a compact and abstract latent motion (1D token). Second, in the Reasoning step, long-term modeling and efficient reasoning are performed in this latent space to generate motion sequences. Finally, in the Generation step, the motion dynamics serve as conditional information to guide the generation model in synthesizing realistic transitions from reference frames to target frames. Thanks to the compact and descriptive nature of Semantic Latent Motion, our method enables real-time video generation with highly realistic motion. User studies demonstrate that our approach surpasses state-of-the-art models with an 81% win rate in realism. Extensive experiments further highlight its strong compression capability, reconstruction quality, and generative potential. Moreover, its fully self-supervised nature suggests promising applications in broader video generation tasks.

Title: Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation

Authors: Jiawei Zhang, Ziyuan Liu, Leon Yan, Gen Li, Yuantao Gu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10103
Pdf URL: https://arxiv.org/pdf/2503.10103
Copy Paste: [[2503.10103]] Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation(https://arxiv.org/abs/2503.10103)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at \href{this https URL}{this https URL\_inverse\_problem}.

Title: MoEdit: On Learning Quantity Perception for Multi-object Image Editing

Authors: Yanfeng Li, Kahou Chan, Yue Sun, Chantong Lam, Tong Tong, Zitong Yu, Keren Fu, Xiaohong Liu, Tao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10112
Pdf URL: https://arxiv.org/pdf/2503.10112
Copy Paste: [[2503.10112]] MoEdit: On Learning Quantity Perception for Multi-object Image Editing(https://arxiv.org/abs/2503.10112)
Keywords: diffusion
Abstract: Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at this https URL.

Title: Hybrid Agents for Image Restoration

Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10120
Pdf URL: https://arxiv.org/pdf/2503.10120
Copy Paste: [[2503.10120]] Hybrid Agents for Image Restoration(https://arxiv.org/abs/2503.10120)
Keywords: in-context
Abstract: Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking the cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.

Title: Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

Authors: Yi Wu, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10125
Pdf URL: https://arxiv.org/pdf/2503.10125
Copy Paste: [[2503.10125]] Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation(https://arxiv.org/abs/2503.10125)
Keywords: diffusion
Abstract: Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.

Title: PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Authors: Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10127
Pdf URL: https://arxiv.org/pdf/2503.10127
Copy Paste: [[2503.10127]] PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models(https://arxiv.org/abs/2503.10127)
Keywords: diffusion
Abstract: In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: this https URL.

Title: Robustness Tokens: Towards Adversarial Robustness of Transformers

Authors: Brian Pulfer, Yury Belousov, Slava Voloshynovskiy
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10191
Pdf URL: https://arxiv.org/pdf/2503.10191
Copy Paste: [[2503.10191]] Robustness Tokens: Towards Adversarial Robustness of Transformers(https://arxiv.org/abs/2503.10191)
Keywords: foundation model
Abstract: Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.

Title: Singular Value Fine-tuning for Few-Shot Class-Incremental Learning

Authors: Zhiwu Wang, Yichen Wu, Renzhen Wang, Haokun Lin, Quanziang Wang, Qian Zhao, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10214
Pdf URL: https://arxiv.org/pdf/2503.10214
Copy Paste: [[2503.10214]] Singular Value Fine-tuning for Few-Shot Class-Incremental Learning(https://arxiv.org/abs/2503.10214)
Keywords: foundation model
Abstract: Class-Incremental Learning (CIL) aims to prevent catastrophic forgetting of previously learned classes while sequentially incorporating new ones. The more challenging Few-shot CIL (FSCIL) setting further complicates this by providing only a limited number of samples for each new class, increasing the risk of overfitting in addition to standard CIL challenges. While catastrophic forgetting has been extensively studied, overfitting in FSCIL, especially with large foundation models, has received less attention. To fill this gap, we propose the Singular Value Fine-tuning for FSCIL (SVFCL) and compared it with existing approaches for adapting foundation models to FSCIL, which primarily build on Parameter Efficient Fine-Tuning (PEFT) methods like prompt tuning and Low-Rank Adaptation (LoRA). Specifically, SVFCL applies singular value decomposition to the foundation model weights, keeping the singular vectors fixed while fine-tuning the singular values for each task, and then merging them. This simple yet effective approach not only alleviates the forgetting problem but also mitigates overfitting more effectively while significantly reducing trainable parameters. Extensive experiments on four benchmark datasets, along with visualizations and ablation studies, validate the effectiveness of SVFCL. The code will be made available.

Title: CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

Authors: Kaixiang Yang, Xin Li, Qiang Li, Zhiwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10216
Pdf URL: https://arxiv.org/pdf/2503.10216
Copy Paste: [[2503.10216]] CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition(https://arxiv.org/abs/2503.10216)
Keywords: diffusion
Abstract: Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world this http URL this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument this http URL, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing this http URL on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at this https URL.

Title: Probability-Flow ODE in Infinite-Dimensional Function Spaces

Authors: Kunwoo Na, Junghyun Lee, Se-Young Yun, Sungbin Lim
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.10219
Pdf URL: https://arxiv.org/pdf/2503.10219
Copy Paste: [[2503.10219]] Probability-Flow ODE in Infinite-Dimensional Function Spaces(https://arxiv.org/abs/2503.10219)
Keywords: diffusion
Abstract: Recent advances in infinite-dimensional diffusion models have demonstrated their effectiveness and scalability in function generation tasks where the underlying structure is inherently infinite-dimensional. To accelerate inference in such models, we derive, for the first time, an analog of the probability-flow ODE (PF-ODE) in infinite-dimensional function spaces. Leveraging this newly formulated PF-ODE, we reduce the number of function evaluations while maintaining sample quality in function generation tasks, including applications to PDEs.

Title: R.U.Psycho? Robust Unified Psychometric Testing of Language Models

Authors: Julian Schelb, Orr Borin, David Garcia, Andreas Spitz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10229
Pdf URL: https://arxiv.org/pdf/2503.10229
Copy Paste: [[2503.10229]] R.U.Psycho? Robust Unified Psychometric Testing of Language Models(https://arxiv.org/abs/2503.10229)
Keywords: generative
Abstract: Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present this http URL, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. this http URL is available as a Python package at this https URL.

Title: SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

Authors: Zhi Chen, Zecheng Zhao, Jingcai Guo, Jingjing Li, Zi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10252
Pdf URL: https://arxiv.org/pdf/2503.10252
Copy Paste: [[2503.10252]] SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning(https://arxiv.org/abs/2503.10252)
Keywords: self-supervised
Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch's semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.

Title: An Open-RAN Testbed for Detecting and Mitigating Radio-Access Anomalies

Authors: Hanna Bogucka, Marcin Hoffmann, Paweł Kryszkiewicz, Łukasz Kułacz
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10255
Pdf URL: https://arxiv.org/pdf/2503.10255
Copy Paste: [[2503.10255]] An Open-RAN Testbed for Detecting and Mitigating Radio-Access Anomalies(https://arxiv.org/abs/2503.10255)
Keywords: anomaly
Abstract: This paper presents the Open Radio Access Net-work (O-RAN) testbed for secure radio access. We discuss radio-originating attack detection and mitigation methods based on anomaly detection and how they can be implemented as specialized applications (xApps) in this testbed. We also pre-sent illustrating results of the methods applied in real-world scenarios and implementations.

Title: ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Authors: Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10256
Pdf URL: https://arxiv.org/pdf/2503.10256
Copy Paste: [[2503.10256]] ROODI: Reconstructing Occluded Objects with Denoising Inpainters(https://arxiv.org/abs/2503.10256)
Keywords: diffusion, generative
Abstract: While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we analyze the local structure of primitives using K-nearest neighbors, and retain only relevant ones. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.

Title: MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

Authors: Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10289
Pdf URL: https://arxiv.org/pdf/2503.10289
Copy Paste: [[2503.10289]] MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion(https://arxiv.org/abs/2503.10289)
Keywords: diffusion
Abstract: Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.

Title: Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings

Authors: Yanis Basso-Bert, Anca Molnos, Romain Lemaire, William Guicquero, Antoine Dupret
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10333
Pdf URL: https://arxiv.org/pdf/2503.10333
Copy Paste: [[2503.10333]] Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings(https://arxiv.org/abs/2503.10333)
Keywords: generative
Abstract: In dynamic environments where new concepts continuously emerge, Deep Neural Networks (DNNs) must adapt by learning new classes while retaining previously acquired ones. This challenge is addressed by Class-Incremental Learning (CIL). This paper introduces Generative Binary Memory (GBM), a novel CIL pseudo-replay approach which generates synthetic binary pseudo-exemplars. Relying on Bernoulli Mixture Models (BMMs), GBM effectively models the multi-modal characteristics of class distributions, in a latent, binary space. With a specifically-designed feature binarizer, our approach applies to any conventional DNN. GBM also natively supports Binary Neural Networks (BNNs) for highly-constrained model sizes in embedded systems. The experimental results demonstrate that GBM achieves higher than state-of-the-art average accuracy on CIFAR100 (+2.9%) and TinyImageNet (+1.5%) for a ResNet-18 equipped with our binarizer. GBM also outperforms emerging CIL methods for BNNs, with +3.1% in final accuracy and x4.7 memory reduction, on CORE50.

Title: DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image

Authors: Qi Zhao, Zhan Ma, Pan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10342
Pdf URL: https://arxiv.org/pdf/2503.10342
Copy Paste: [[2503.10342]] DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image(https://arxiv.org/abs/2503.10342)
Keywords: diffusion, generative
Abstract: Recent developments in generative diffusion models have turned many dreams into realities. For video object insertion, existing methods typically require additional information, such as a reference video or a 3D asset of the object, to generate the synthetic motion. However, inserting an object from a single reference photo into a target background video remains an uncharted area due to the lack of unseen motion information. We propose DreamInsert, which achieves Image-to-Video Object Insertion in a training-free manner for the first time. By incorporating the trajectory of the object into consideration, DreamInsert can predict the unseen object movement, fuse it harmoniously with the background video, and generate the desired video seamlessly. More significantly, DreamInsert is both simple and effective, achieving zero-shot insertion without end-to-end training or additional fine-tuning on well-designed image-video data pairs. We demonstrated the effectiveness of DreamInsert through a variety of experiments. Leveraging this capability, we present the first results for Image-to-Video object insertion in a training-free manner, paving exciting new directions for future content creation and synthesis. The code will be released soon.

Title: Enhancing Facial Privacy Protection via Weakening Diffusion Purification

Authors: Ali Salar, Qing Liu, Yingli Tian, Guoying Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10350
Pdf URL: https://arxiv.org/pdf/2503.10350
Copy Paste: [[2503.10350]] Enhancing Facial Privacy Protection via Weakening Diffusion Purification(https://arxiv.org/abs/2503.10350)
Keywords: diffusion
Abstract: The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance.

Title: ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation

Authors: Zirun Guo, Tao Jin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10358
Pdf URL: https://arxiv.org/pdf/2503.10358
Copy Paste: [[2503.10358]] ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation(https://arxiv.org/abs/2503.10358)
Keywords: diffusion
Abstract: Diffusion customization methods have achieved impressive results with only a minimal number of user-provided images. However, existing approaches customize concepts collectively, whereas real-world applications often require sequential concept integration. This sequential nature can lead to catastrophic forgetting, where previously learned concepts are lost. In this paper, we investigate concept forgetting and concept confusion in the continual customization. To tackle these challenges, we present ConceptGuard, a comprehensive approach that combines shift embedding, concept-binding prompts and memory preservation regularization, supplemented by a priority queue which can adaptively update the importance and occurrence order of different concepts. These strategies can dynamically update, unbind and learn the relationship of the previous concepts, thus alleviating concept forgetting and confusion. Through comprehensive experiments, we show that our approach outperforms all the baseline methods consistently and significantly in both quantitative and qualitative analyses.

Title: Piece it Together: Part-Based Concepting with IP-Priors

Authors: Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10365
Pdf URL: https://arxiv.org/pdf/2503.10365
Copy Paste: [[2503.10365]] Piece it Together: Part-Based Concepting with IP-Priors(https://arxiv.org/abs/2503.10365)
Keywords: generative
Abstract: Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

Title: Probabilistic Forecasting via Autoregressive Flow Matching

Authors: Ahmed El-Gazzar, Marcel van Gerven
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10375
Pdf URL: https://arxiv.org/pdf/2503.10375
Copy Paste: [[2503.10375]] Probabilistic Forecasting via Autoregressive Flow Matching(https://arxiv.org/abs/2503.10375)
Keywords: generative
Abstract: In this work, we propose FlowTime, a generative model for probabilistic forecasting of multivariate timeseries data. Given historical measurements and optional future covariates, we formulate forecasting as sampling from a learned conditional distribution over future trajectories. Specifically, we decompose the joint distribution of future observations into a sequence of conditional densities, each modeled via a shared flow that transforms a simple base distribution into the next observation distribution, conditioned on observed covariates. To achieve this, we leverage the flow matching (FM) framework, enabling scalable and simulation-free learning of these transformations. By combining this factorization with the FM objective, FlowTime retains the benefits of autoregressive models -- including strong extrapolation performance, compact model size, and well-calibrated uncertainty estimates -- while also capturing complex multi-modal conditional distributions, as seen in modern transport-based generative models. We demonstrate the effectiveness of FlowTime on multiple dynamical systems and real-world forecasting tasks.

Title: CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

Authors: Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, Chongyang Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10391
Pdf URL: https://arxiv.org/pdf/2503.10391
Copy Paste: [[2503.10391]] CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance(https://arxiv.org/abs/2503.10391)
Keywords: diffusion, generative
Abstract: Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.

Title: RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

Authors: Fengxiang Wang, Hongzhen Wang, Yulin Wang, Di Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Long Lan, Wenjing Yang, Jing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10392
Pdf URL: https://arxiv.org/pdf/2503.10392
Copy Paste: [[2503.10392]] RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing(https://arxiv.org/abs/2503.10392)
Keywords: self-supervised, foundation model
Abstract: Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at this https URL.

Title: Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders

Authors: Jingyu Guo, Sensen Gao, Jia-Wang Bian, Wanhu Sun, Heliang Zheng, Rongfei Jia, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10403
Pdf URL: https://arxiv.org/pdf/2503.10403
Copy Paste: [[2503.10403]] Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders(https://arxiv.org/abs/2503.10403)
Keywords: diffusion
Abstract: Recent 3D content generation pipelines often leverage Variational Autoencoders (VAEs) to encode shapes into compact latent representations, facilitating diffusion-based generation. Efficiently compressing 3D shapes while preserving intricate geometric details remains a key challenge. Existing 3D shape VAEs often employ uniform point sampling and 1D/2D latent representations, such as vector sets or triplanes, leading to significant geometric detail loss due to inadequate surface coverage and the absence of explicit 3D representations in the latent space. Although recent work explores 3D latent representations, their large scale hinders high-resolution encoding and efficient training. Given these challenges, we introduce Hyper3D, which enhances VAE reconstruction through efficient 3D representation that integrates hybrid triplane and octree features. First, we adopt an octree-based feature representation to embed mesh information into the network, mitigating the limitations of uniform point sampling in capturing geometric distributions along the mesh surface. Furthermore, we propose a hybrid latent space representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design not only compensates for the lack of explicit 3D representations but also leverages a triplane to preserve high-resolution details. Experimental results demonstrate that Hyper3D outperforms traditional representations by reconstructing 3D shapes with higher fidelity and finer details, making it well-suited for 3D generation pipelines.

Title: RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Authors: Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10406
Pdf URL: https://arxiv.org/pdf/2503.10406
Copy Paste: [[2503.10406]] RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models(https://arxiv.org/abs/2503.10406)
Keywords: in-context
Abstract: Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: this https URL

Title: Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning

Authors: Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10408
Pdf URL: https://arxiv.org/pdf/2503.10408
Copy Paste: [[2503.10408]] Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning(https://arxiv.org/abs/2503.10408)
Keywords: in-context
Abstract: We study the capabilities of Large Language Models (LLM) on binary relations, a ubiquitous concept in math employed in most reasoning, math and logic benchmarks. This work focuses on equality, inequality, and inclusion, along with the properties they satisfy, such as ir/reflexivity, a/symmetry, transitivity, and logical complexity (e.g., number of reasoning ``hops''). We propose an alternative to in-context learning that trains only the representations of newly introduced tokens, namely out-of-context representation learning. This method mitigates linguistic biases already present in a model and, differently from in-context learning, does not rely on external information or illustrations. We argue out-of-context representation learning as a better alternative to in-context learning and fine-tuning to evaluate the capabilities of LLMs on logic tasks that are the building blocks of more complex reasoning benchmarks.

Title: Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Authors: Evgeniia Vu, Andrei Boiarov, Dmitry Vetrov
Subjects: cs.LG, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.10488
Pdf URL: https://arxiv.org/pdf/2503.10488
Copy Paste: [[2503.10488]] Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion(https://arxiv.org/abs/2503.10488)
Keywords: diffusion
Abstract: Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation that extends rolling diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup with high visual fidelity and temporal coherence. We evaluate our approach on ZEGGS and BEAT, strong benchmarks for real-world applicability. Our framework is universally applicable to any diffusion-based gesture generation model, transforming it into a streaming approach. Applied to three state-of-the-art methods, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time, high-fidelity co-speech gesture synthesis.

Title: Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction

Authors: Yuhan Wang, Cheng Liu, Daou Zhang, Weichao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10508
Pdf URL: https://arxiv.org/pdf/2503.10508
Copy Paste: [[2503.10508]] Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction(https://arxiv.org/abs/2503.10508)
Keywords: generative, anomaly
Abstract: In the domain of Image Anomaly Detection (IAD), Existing methods frequently exhibit a paucity of fine-grained, interpretable semantic information, resulting in the detection of anomalous entities or activities that are susceptible to machine illusions. This deficiency often leads to the detection of anomalous entities or actions that are susceptible to machine illusions and lack sufficient explanation. In this thesis, we propose a novel approach to anomaly detection, termed Hoi2Anomaly, which aims to achieve precise discrimination and localization of anomalies. The proposed methodology involves the construction of a multi-modal instruction tuning dataset comprising human-object interaction (HOI) pairs in anomalous scenarios. Second, we have trained an HOI extractor in threat scenarios to localize and match anomalous actions and entities. Finally, explanatory content is generated for the detected anomalous HOI by fine-tuning the visual language pretraining (VLP) framework. The experimental results demonstrate that Hoi2Anomaly surpasses existing generative approaches in terms of precision and explainability. We will release Hoi2Anomaly for the advancement of the field of anomaly detection.

Title: Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression

Authors: Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, Janaradhan Rao Doppa
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10512
Pdf URL: https://arxiv.org/pdf/2503.10512
Copy Paste: [[2503.10512]] Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression(https://arxiv.org/abs/2503.10512)
Keywords: generative
Abstract: We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.

Title: PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

Authors: Zilu Guo, Hongbin Lin, Zhihao Yuan, Chaoda Zheng, Pengshuo Qiu, Dongzhi Jiang, Renrui Zhang, Chun-Mei Feng, Zhen Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10529
Pdf URL: https://arxiv.org/pdf/2503.10529
Copy Paste: [[2503.10529]] PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models(https://arxiv.org/abs/2503.10529)
Keywords: generative
Abstract: 3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.

Title: DP-GPL: Differentially Private Graph Prompt Learning

Authors: Jing Xu, Franziska Boenisch, Iyiola Emmanuel Olatunji, Adam Dziedzic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10544
Pdf URL: https://arxiv.org/pdf/2503.10544
Copy Paste: [[2503.10544]] DP-GPL: Differentially Private Graph Prompt Learning(https://arxiv.org/abs/2503.10544)
Keywords: foundation model
Abstract: Graph Neural Networks (GNNs) have shown remarkable performance in various applications. Recently, graph prompt learning has emerged as a powerful GNN training paradigm, inspired by advances in language and vision foundation models. Here, a GNN is pre-trained on public data and then adapted to sensitive tasks using lightweight graph prompts. However, using prompts from sensitive data poses privacy risks. In this work, we are the first to investigate these practical risks in graph prompts by instantiating a membership inference attack that reveals significant privacy leakage. We also find that the standard privacy method, DP-SGD, fails to provide practical privacy-utility trade-offs in graph prompt learning, likely due to the small number of sensitive data points used to learn the prompts. As a solution, we propose DP-GPL for differentially private graph prompt learning based on the PATE framework, that generates a graph prompt with differential privacy guarantees. Our evaluation across various graph prompt learning methods, GNN architectures, and pre-training strategies demonstrates that our algorithm achieves high utility at strong privacy, effectively mitigating privacy concerns while preserving the powerful capabilities of prompted GNNs as powerful foundation models in the graph domain.

Title: MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup

Authors: Youngjin Kwon, Xiao Zhang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10549
Pdf URL: https://arxiv.org/pdf/2503.10549
Copy Paste: [[2503.10549]] MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup(https://arxiv.org/abs/2503.10549)
Keywords: diffusion, generative
Abstract: As facial recognition is increasingly adopted for government and commercial services, its potential misuse has raised serious concerns about privacy and civil rights. To counteract, various anti-facial recognition techniques have been proposed for privacy protection by adversarially perturbing face images, among which generative makeup-based approaches are the most popular. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity and stronger adaptability to various text makeup prompts.

Title: Long Context Tuning for Video Generation

Authors: Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10589
Pdf URL: https://arxiv.org/pdf/2503.10589
Copy Paste: [[2503.10589]] Long Context Tuning for Video Generation(https://arxiv.org/abs/2503.10589)
Keywords: diffusion
Abstract: Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See this https URL for more details.

Title: CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

Authors: Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10592
Pdf URL: https://arxiv.org/pdf/2503.10592
Copy Paste: [[2503.10592]] CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models(https://arxiv.org/abs/2503.10592)
Keywords: diffusion, generative
Abstract: This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.

Title: MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Authors: Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10604
Pdf URL: https://arxiv.org/pdf/2503.10604
Copy Paste: [[2503.10604]] MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction(https://arxiv.org/abs/2503.10604)
Keywords: diffusion
Abstract: Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

Title: CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Authors: Advait Gupta, NandaKiran Velaga, Dang Nguyen, Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10613
Pdf URL: https://arxiv.org/pdf/2503.10613
Copy Paste: [[2503.10613]] CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing(https://arxiv.org/abs/2503.10613)
Keywords: diffusion
Abstract: Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

Title: ConsisLoRA: Enhancing Content and Style Consistency for LoRA-based Style Transfer

Authors: Bolin Chen, Baoquan Zhao, Haoran Xie, Yi Cai, Qing Li, Xudong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10614
Pdf URL: https://arxiv.org/pdf/2503.10614
Copy Paste: [[2503.10614]] ConsisLoRA: Enhancing Content and Style Consistency for LoRA-based Style Transfer(https://arxiv.org/abs/2503.10614)
Keywords: diffusion
Abstract: Style transfer involves transferring the style from a reference image to the content of a target image. Recent advancements in LoRA-based (Low-Rank Adaptation) methods have shown promise in effectively capturing the style of a single image. However, these approaches still face significant challenges such as content inconsistency, style misalignment, and content leakage. In this paper, we comprehensively analyze the limitations of the standard diffusion parameterization, which learns to predict noise, in the context of style transfer. To address these issues, we introduce ConsisLoRA, a LoRA-based method that enhances both content and style consistency by optimizing the LoRA weights to predict the original image rather than noise. We also propose a two-step training strategy that decouples the learning of content and style from the reference image. To effectively capture both the global structure and local details of the content image, we introduce a stepwise loss transition strategy. Additionally, we present an inference guidance method that enables continuous control over content and style strengths during inference. Through both qualitative and quantitative evaluations, our method demonstrates significant improvements in content and style consistency while effectively reducing content leakage.

Title: DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Authors: Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10618
Pdf URL: https://arxiv.org/pdf/2503.10618
Copy Paste: [[2503.10618]] DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation(https://arxiv.org/abs/2503.10618)
Keywords: diffusion
Abstract: In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.

Title: Transformers without Normalization

Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10622
Pdf URL: https://arxiv.org/pdf/2503.10622
Copy Paste: [[2503.10622]] Transformers without Normalization(https://arxiv.org/abs/2503.10622)
Keywords: self-supervised
Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Title: NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Authors: Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10626
Pdf URL: https://arxiv.org/pdf/2503.10626
Copy Paste: [[2503.10626]] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models(https://arxiv.org/abs/2503.10626)
Keywords: diffusion, generative
Abstract: Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.

Title: Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology

Authors: Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10629
Pdf URL: https://arxiv.org/pdf/2503.10629
Copy Paste: [[2503.10629]] Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology(https://arxiv.org/abs/2503.10629)
Keywords: self-supervised
Abstract: Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at this https URL.

Title: HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Authors: Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10631
Pdf URL: https://arxiv.org/pdf/2503.10631
Copy Paste: [[2503.10631]] HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model(https://arxiv.org/abs/2503.10631)
Keywords: diffusion
Abstract: Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.

Title: V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes

Authors: Yanming Zhang, Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10634
Pdf URL: https://arxiv.org/pdf/2503.10634
Copy Paste: [[2503.10634]] V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes(https://arxiv.org/abs/2503.10634)
Keywords: diffusion
Abstract: This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.

Title: Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

Authors: Xiaoming Zhao, Alexander G. Schwing
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10638
Pdf URL: https://arxiv.org/pdf/2503.10638
Copy Paste: [[2503.10638]] Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective(https://arxiv.org/abs/2503.10638)
Keywords: diffusion
Abstract: Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.

Title: GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10639
Pdf URL: https://arxiv.org/pdf/2503.10639
Copy Paste: [[2503.10639]] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing(https://arxiv.org/abs/2503.10639)
Keywords: diffusion
Abstract: Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at this https URL.