2025-02-13

Title: CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Authors: Shihab Aaqil Ahamed, Malitha Gunawardhana, Liel David, Michael Sidorov, Daniel Harari, Muhammad Haris Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.07811
Pdf URL: https://arxiv.org/pdf/2502.07811
Copy Paste: [[2502.07811]] CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders(https://arxiv.org/abs/2502.07811)
Keywords: self-supervised
Abstract: Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model's ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.

Title: Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution

Authors: Siwei Tu, Ben Fei, Weidong Yang, Fenghua Ling, Hao Chen, Zili Liu, Kun Chen, Hang Fan, Wanli Ouyang, Lei Bai
Subjects: cs.LG, cs.AI, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2502.07814
Pdf URL: https://arxiv.org/pdf/2502.07814
Copy Paste: [[2502.07814]] Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution(https://arxiv.org/abs/2502.07814)
Keywords: diffusion
Abstract: Accurate acquisition of surface meteorological conditions at arbitrary locations holds significant importance for weather forecasting and climate simulation. Due to the fact that meteorological states derived from satellite observations are often provided in the form of low-resolution grid fields, the direct application of spatial interpolation to obtain meteorological states for specific locations often results in significant discrepancies when compared to actual observations. Existing downscaling methods for acquiring meteorological state information at higher resolutions commonly overlook the correlation with satellite observations. To bridge the gap, we propose Satellite-observations Guided Diffusion Model (SGD), a conditional diffusion model pre-trained on ERA5 reanalysis data with satellite observations (GridSat) as conditions, which is employed for sampling downscaled meteorological states through a zero-shot guided sampling strategy and patch-based methods. During the training process, we propose to fuse the information from GridSat satellite observations into ERA5 maps via the attention mechanism, enabling SGD to generate atmospheric states that align more accurately with actual conditions. In the sampling, we employed optimizable convolutional kernels to simulate the upscale process, thereby generating high-resolution ERA5 maps using low-resolution ERA5 maps as well as observations from weather stations as guidance. Moreover, our devised patch-based method promotes SGD to generate meteorological states at arbitrary resolutions. Experiments demonstrate SGD fulfills accurate meteorological states downscaling to 6.25km.

Title: Pre-Trained Video Generative Models as World Simulators

Authors: Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, Ling Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07825
Pdf URL: https://arxiv.org/pdf/2502.07825
Copy Paste: [[2502.07825]] Pre-Trained Video Generative Models as World Simulators(https://arxiv.org/abs/2502.07825)
Keywords: diffusion, generative
Abstract: Video generative models pre-trained on large-scale internet datasets have achieved remarkable success, excelling at producing realistic synthetic videos. However, they often generate clips based on static prompts (e.g., text or images), limiting their ability to model interactive and dynamic scenarios. In this paper, we propose Dynamic World Simulation (DWS), a novel approach to transform pre-trained video generative models into controllable world simulators capable of executing specified action trajectories. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module that seamlessly integrates into any existing model. Instead of focusing on complex visual details, we demonstrate that consistent dynamic transition modeling is the key to building powerful world simulators. Building upon this insight, we further introduce a motion-reinforced loss that enhances action controllability by compelling the model to capture dynamic changes more effectively. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models, achieving significant improvements in generating action-controllable, dynamically consistent videos across games and robotics domains. Moreover, to facilitate the applications of the learned world simulator in downstream tasks such as model-based reinforcement learning, we propose prioritized imagination to improve sample efficiency, demonstrating competitive performance compared with state-of-the-art methods.

Title: Preference Alignment on Diffusion Model: A Comprehensive Survey for Image Generation and Editing

Authors: Sihao Wu, Xiaonan Si, Chi Xing, Jianhong Wang, Gaojie Jin, Guangliang Cheng, Lijun Zhang, Xiaowei Huang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07829
Pdf URL: https://arxiv.org/pdf/2502.07829
Copy Paste: [[2502.07829]] Preference Alignment on Diffusion Model: A Comprehensive Survey for Image Generation and Editing(https://arxiv.org/abs/2502.07829)
Keywords: diffusion
Abstract: The integration of preference alignment with diffusion models (DMs) has emerged as a transformative approach to enhance image generation and editing capabilities. Although integrating diffusion models with preference alignment strategies poses significant challenges for novices at this intersection, comprehensive and systematic reviews of this subject are still notably lacking. To bridge this gap, this paper extensively surveys preference alignment with diffusion models in image generation and editing. First, we systematically review cutting-edge optimization techniques such as reinforcement learning with human feedback (RLHF), direct preference optimization (DPO), and others, highlighting their pivotal role in aligning preferences with DMs. Then, we thoroughly explore the applications of aligning preferences with DMs in autonomous driving, medical imaging, robotics, and more. Finally, we comprehensively discuss the challenges of preference alignment with DMs. To our knowledge, this is the first survey centered on preference alignment with DMs, providing insights to drive future innovation in this dynamic area.

Title: Captured by Captions: On Memorization and its Mitigation in CLIP Models

Authors: Wenhao Wang, Adam Dziedzic, Grace C. Kim, Michael Backes, Franziska Boenisch
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07830
Pdf URL: https://arxiv.org/pdf/2502.07830
Copy Paste: [[2502.07830]] Captured by Captions: On Memorization and its Mitigation in CLIP Models(https://arxiv.org/abs/2502.07830)
Keywords: self-supervised
Abstract: Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP's memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility--something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.

Title: TranSplat: Surface Embedding-guided 3D Gaussian Splatting for Transparent Object Manipulation

Authors: Jeongyun Kim, Jeongho Noh, Dong-Guw Lee, Ayoung Kim
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2502.07840
Pdf URL: https://arxiv.org/pdf/2502.07840
Copy Paste: [[2502.07840]] TranSplat: Surface Embedding-guided 3D Gaussian Splatting for Transparent Object Manipulation(https://arxiv.org/abs/2502.07840)
Keywords: diffusion
Abstract: Transparent object manipulation remains a sig- nificant challenge in robotics due to the difficulty of acquiring accurate and dense depth measurements. Conventional depth sensors often fail with transparent objects, resulting in in- complete or erroneous depth data. Existing depth completion methods struggle with interframe consistency and incorrectly model transparent objects as Lambertian surfaces, leading to poor depth reconstruction. To address these challenges, we propose TranSplat, a surface embedding-guided 3D Gaussian Splatting method tailored for transparent objects. TranSplat uses a latent diffusion model to generate surface embeddings that provide consistent and continuous representations, making it robust to changes in viewpoint and lighting. By integrating these surface embeddings with input RGB images, TranSplat effectively captures the complexities of transparent surfaces, enhancing the splatting of 3D Gaussians and improving depth completion. Evaluations on synthetic and real-world transpar- ent object benchmarks, as well as robot grasping tasks, show that TranSplat achieves accurate and dense depth completion, demonstrating its effectiveness in practical applications. We open-source synthetic dataset and model: https://github. com/jeongyun0609/TranSplat

Title: Spread them Apart: Towards Robust Watermarking of Generated Content

Authors: Mikhail Pautov, Danil Ivanov, Andrey V. Galichin, Oleg Rogov, Ivan Oseledets
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07845
Pdf URL: https://arxiv.org/pdf/2502.07845
Copy Paste: [[2502.07845]] Spread them Apart: Towards Robust Watermarking of Generated Content(https://arxiv.org/abs/2502.07845)
Keywords: diffusion, generative
Abstract: Generative models that can produce realistic images have improved significantly in recent years. The quality of the generated content has increased drastically, so sometimes it is very difficult to distinguish between the real images and the generated ones. Such an improvement comes at a price of ethical concerns about the usage of the generative models: the users of generative models can improperly claim ownership of the generated content protected by a license. In this paper, we propose an approach to embed watermarks into the generated content to allow future detection of the generated content and identification of the user who generated it. The watermark is embedded during the inference of the model, so the proposed approach does not require the retraining of the latter. We prove that watermarks embedded are guaranteed to be robust against additive perturbations of a bounded magnitude. We apply our method to watermark diffusion models and show that it matches state-of-the-art watermarking schemes in terms of robustness to different types of synthetic watermark removal attacks.

Title: Technical note on calibrating vision-language models under covariate shift

Authors: Behraj Khan, Rizwan Qureshi, Tahir Syed
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07847
Pdf URL: https://arxiv.org/pdf/2502.07847
Copy Paste: [[2502.07847]] Technical note on calibrating vision-language models under covariate shift(https://arxiv.org/abs/2502.07847)
Keywords: foundation model
Abstract: Despite being a successful example of emerging capability, vision-language foundation models for low-shot vision classification have a limited ability to sufficiently generalize to the target data distribution due to sample poverty, leading to sensitivity to variations in the data. A popular mitigation strategy is finetuning over multiple datasets, but domain generalization is expensive when practiced in this manner. This work examines both covariate shift between pre-training data and the underspecified target data, and \textit{confidence misalignment}, where the model's prediction confidence amplified by the limited data availability. We propose \textit{Confidence-Calibrated Covariate Shift Correction ($C3SC$)}, a unified framework to mitigate both covariate shift and confidence misalignment. $C3SC$ leverages Fisher information penalty for covariate shift correction and confidence misalignment penalty (CMP) to lower confidence on misclassified examples. Experimental results across various vision and covariate shift datasets demonstrates that $C3SC$ significantly improves in calibration (ECE) by $5.82\%$ at maximum. $C3SC$ shows better robustness as well by showing $3.5\%$ improvement in accuracy metric on challenging covariate shift datasets, making $C3SC$ a promising solution for reliable real-world vision-language low-shot applications under distribution shift.

Title: Understanding Classifier-Free Guidance: High-Dimensional Theory and Non-Linear Generalizations

Authors: Krunoslav Lehman Pavasovic, Jakob Verbeek, Giulio Biroli, Marc Mezard
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.07849
Pdf URL: https://arxiv.org/pdf/2502.07849
Copy Paste: [[2502.07849]] Understanding Classifier-Free Guidance: High-Dimensional Theory and Non-Linear Generalizations(https://arxiv.org/abs/2502.07849)
Keywords: diffusion
Abstract: Recent studies have raised concerns about the effectiveness of Classifier-Free Guidance (CFG), indicating that in low-dimensional settings, it can lead to overshooting the target distribution and reducing sample diversity. In this work, we demonstrate that in infinite and sufficiently high-dimensional contexts CFG effectively reproduces the target distribution, revealing a blessing-of-dimensionality result. Additionally, we explore finite-dimensional effects, precisely characterizing overshoot and variance reduction. Based on our analysis, we introduce non-linear generalizations of CFG. Through numerical simulations on Gaussian mixtures and experiments on class-conditional and text-to-image diffusion models, we validate our analysis and show that our non-linear CFG offers improved flexibility and generation quality without additional computation cost.

Title: MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers

Authors: Ao Li, Wei Fang, Hongbo Zhao, Le Lu, Ge Yang, Minfeng Xu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07856
Pdf URL: https://arxiv.org/pdf/2502.07856
Copy Paste: [[2502.07856]] MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers(https://arxiv.org/abs/2502.07856)
Keywords: diffusion
Abstract: In applications of diffusion models, controllable generation is of practical significance, but is also challenging. Current methods for controllable generation primarily focus on modifying the score function of diffusion models, while Mean Reverting (MR) Diffusion directly modifies the structure of the stochastic differential equation (SDE), making the incorporation of image conditions simpler and more natural. However, current training-free fast samplers are not directly applicable to MR Diffusion. And thus MR Diffusion requires hundreds of NFEs (number of function evaluations) to obtain high-quality samples. In this paper, we propose a new algorithm named MRS (MR Sampler) to reduce the sampling NFEs of MR Diffusion. We solve the reverse-time SDE and the probability flow ordinary differential equation (PF-ODE) associated with MR Diffusion, and derive semi-analytical solutions. The solutions consist of an analytical function and an integral parameterized by a neural network. Based on this solution, we can generate high-quality samples in fewer steps. Our approach does not require training and supports all mainstream parameterizations, including noise prediction, data prediction and velocity prediction. Extensive experiments demonstrate that MR Sampler maintains high sampling quality with a speedup of 10 to 20 times across ten different image restoration tasks. Our algorithm accelerates the sampling procedure of MR Diffusion, making it more practical in controllable generation.

Title: MAAT: Mamba Adaptive Anomaly Transformer with association discrepancy for time series

Authors: Abdellah Zakaria Sellam, Ilyes Benaissa, Abdelmalik Taleb-Ahmed, Luigi Patrono, Cosimo Distante
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.07858
Pdf URL: https://arxiv.org/pdf/2502.07858
Copy Paste: [[2502.07858]] MAAT: Mamba Adaptive Anomaly Transformer with association discrepancy for time series(https://arxiv.org/abs/2502.07858)
Keywords: anomaly
Abstract: Anomaly detection in time series is essential for industrial monitoring and environmental sensing, yet distinguishing anomalies from complex patterns remains challenging. Existing methods like the Anomaly Transformer and DCdetector have progressed, but they face limitations such as sensitivity to short-term contexts and inefficiency in noisy, non-stationary environments. To overcome these issues, we introduce MAAT, an improved architecture that enhances association discrepancy modeling and reconstruction quality. MAAT features Sparse Attention, efficiently capturing long-range dependencies by focusing on relevant time steps, thereby reducing computational redundancy. Additionally, a Mamba-Selective State Space Model is incorporated into the reconstruction module, utilizing a skip connection and Gated Attention to improve anomaly localization and detection performance. Extensive experiments show that MAAT significantly outperforms previous methods, achieving better anomaly distinguishability and generalization across various time series applications, setting a new standard for unsupervised time series anomaly detection in real-world scenarios.

Title: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Authors: Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.07870
Pdf URL: https://arxiv.org/pdf/2502.07870
Copy Paste: [[2502.07870]] TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation(https://arxiv.org/abs/2502.07870)
Keywords: generative
Abstract: Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.

Title: Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning

Authors: Rujing Yao, Yang Wu, Chenghao Wang, Jingwei Xiong, Fang Wang, Xiaozhong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07912
Pdf URL: https://arxiv.org/pdf/2502.07912
Copy Paste: [[2502.07912]] Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning(https://arxiv.org/abs/2502.07912)
Keywords: in-context
Abstract: Large Language Models (LLMs) have achieved impressive results across numerous domains, yet they experience notable deficiencies in legal question-answering tasks. LLMs often generate generalized responses that lack the logical specificity required for expert legal advice and are prone to hallucination, providing answers that appear correct but are unreliable. Retrieval-Augmented Generation (RAG) techniques offer partial solutions to address this challenge, but existing approaches typically focus only on semantic similarity, neglecting the logical structure essential to legal reasoning. In this paper, we propose the Logical-Semantic Integration Model (LSIM), a novel supervised framework that bridges semantic and logical coherence. LSIM comprises three components: reinforcement learning predicts a structured fact-rule chain for each question, a trainable Deep Structured Semantic Model (DSSM) retrieves the most relevant candidate questions by integrating semantic and logical features, and in-context learning generates the final answer using the retrieved content. Our experiments on a real-world legal QA dataset-validated through both automated metrics and human evaluation-demonstrate that LSIM significantly enhances accuracy and reliability compared to existing methods.

Title: SurGrID: Controllable Surgical Simulation via Scene Graph to Image Diffusion

Authors: Yannik Frisch, Ssharvien Kumar Sivakumar, Çağhan Köksal, Elsa Böhm, Felix Wagner, Adrian Gericke, Ghazal Ghazaei, Anirban Mukhopadhyay
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07945
Pdf URL: https://arxiv.org/pdf/2502.07945
Copy Paste: [[2502.07945]] SurGrID: Controllable Surgical Simulation via Scene Graph to Image Diffusion(https://arxiv.org/abs/2502.07945)
Keywords: diffusion
Abstract: Surgical simulation offers a promising addition to conventional surgical training. However, available simulation tools lack photorealism and rely on hardcoded behaviour. Denoising Diffusion Models are a promising alternative for high-fidelity image synthesis, but existing state-of-the-art conditioning methods fall short in providing precise control or interactivity over the generated scenes. We introduce SurGrID, a Scene Graph to Image Diffusion Model, allowing for controllable surgical scene synthesis by leveraging Scene Graphs. These graphs encode a surgical scene's components' spatial and semantic information, which are then translated into an intermediate representation using our novel pre-training step that explicitly captures local and global information. Our proposed method improves the fidelity of generated images and their coherence with the graph input over the state-of-the-art. Further, we demonstrate the simulation's realism and controllability in a user assessment study involving clinical experts. Scene Graphs can be effectively used for precise and interactive conditioning of Denoising Diffusion Models for simulating surgical scenes, enabling high fidelity and interactive control over the generated content.

Title: Federated Self-supervised Domain Generalization for Label-efficient Polyp Segmentation

Authors: Xinyi Tan, Jiacheng Wang, Liansheng Wang
Subjects: cs.CV, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07951
Pdf URL: https://arxiv.org/pdf/2502.07951
Copy Paste: [[2502.07951]] Federated Self-supervised Domain Generalization for Label-efficient Polyp Segmentation(https://arxiv.org/abs/2502.07951)
Keywords: self-supervised
Abstract: Employing self-supervised learning (SSL) methodologies assumes par-amount significance in handling unlabeled polyp datasets when building deep learning-based automatic polyp segmentation models. However, the intricate privacy dynamics surrounding medical data often preclude seamless data sharing among disparate medical centers. Federated learning (FL) emerges as a formidable solution to this privacy conundrum, yet within the realm of FL, optimizing model generalization stands as a pressing imperative. Robust generalization capabilities are imperative to ensure the model's efficacy across diverse geographical domains post-training on localized client datasets. In this paper, a Federated self-supervised Domain Generalization method is proposed to enhance the generalization capacity of federated and Label-efficient intestinal polyp segmentation, named LFDG. Based on a classical SSL method, DropPos, LFDG proposes an adversarial learning-based data augmentation method (SSADA) to enhance the data diversity. LFDG further proposes a relaxation module based on Source-reconstruction and Augmentation-masking (SRAM) to maintain stability in feature learning. We have validated LFDG on polyp images from six medical centers. The performance of our method achieves 3.80% and 3.92% better than the baseline and other recent FL methods and SSL methods, respectively.

Title: Generative Risk Minimization for Out-of-Distribution Generalization on Graphs

Authors: Song Wang, Zhen Tan, Yaochen Zhu, Chuxu Zhang, Jundong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07968
Pdf URL: https://arxiv.org/pdf/2502.07968
Copy Paste: [[2502.07968]] Generative Risk Minimization for Out-of-Distribution Generalization on Graphs(https://arxiv.org/abs/2502.07968)
Keywords: generative
Abstract: Out-of-distribution (OOD) generalization on graphs aims at dealing with scenarios where the test graph distribution differs from the training graph distributions. Compared to i.i.d. data like images, the OOD generalization problem on graph-structured data remains challenging due to the non-i.i.d. property and complex structural information on graphs. Recently, several works on graph OOD generalization have explored extracting invariant subgraphs that share crucial classification information across different distributions. Nevertheless, such a strategy could be suboptimal for entirely capturing the invariant information, as the extraction of discrete structures could potentially lead to the loss of invariant information or the involvement of spurious information. In this paper, we propose an innovative framework, named Generative Risk Minimization (GRM), designed to generate an invariant subgraph for each input graph to be classified, instead of extraction. To address the challenge of optimization in the absence of optimal invariant subgraphs (i.e., ground truths), we derive a tractable form of the proposed GRM objective by introducing a latent causal variable, and its effectiveness is validated by our theoretical analysis. We further conduct extensive experiments across a variety of real-world graph datasets for both node-level and graph-level OOD generalization, and the results demonstrate the superiority of our framework GRM.

Title: A Survey of In-Context Reinforcement Learning

Authors: Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, Shangtong Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.07978
Pdf URL: https://arxiv.org/pdf/2502.07978
Copy Paste: [[2502.07978]] A Survey of In-Context Reinforcement Learning(https://arxiv.org/abs/2502.07978)
Keywords: in-context
Abstract: Reinforcement learning (RL) agents typically optimize their policies by performing expensive backward passes to update their network parameters. However, some agents can solve new tasks without updating any parameters by simply conditioning on additional context such as their action-observation histories. This paper surveys work on such behavior, known as in-context reinforcement learning.

Title: Towards Training One-Step Diffusion Models Without Distillation

Authors: Mingtian Zhang, Jiajun He, Wenlin Chen, Zijing Ou, José Miguel Hernández-Lobato, Bernhard Schölkopf, David Barber
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.08005
Pdf URL: https://arxiv.org/pdf/2502.08005
Copy Paste: [[2502.08005]] Towards Training One-Step Diffusion Models Without Distillation(https://arxiv.org/abs/2502.08005)
Keywords: diffusion, generative
Abstract: Recent advances in one-step generative models typically follow a two-stage process: first training a teacher diffusion model and then distilling it into a one-step student model. This distillation process traditionally relies on both the teacher model's score function to compute the distillation loss and its weights for student initialization. In this paper, we explore whether one-step generative models can be trained directly without this distillation process. First, we show that the teacher's score function is not essential and propose a family of distillation methods that achieve competitive results without relying on score estimation. Next, we demonstrate that initialization from teacher weights is indispensable in successful training. Surprisingly, we find that this benefit is not due to improved ``input-output" mapping but rather the learned feature representations, which dominate distillation quality. Our findings provide a better understanding of the role of initialization in one-step model training and its impact on distillation quality.

Title: Greed is Good: Guided Generation from a Greedy Perspective

Authors: Zander W. Blasingame, Chen Liu
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.08006
Pdf URL: https://arxiv.org/pdf/2502.08006
Copy Paste: [[2502.08006]] Greed is Good: Guided Generation from a Greedy Perspective(https://arxiv.org/abs/2502.08006)
Keywords: diffusion, generative
Abstract: Training-free guided generation is a widely used and powerful technique that allows the end user to exert further control over the generative process of diffusion models. In this work, we explore the guided generation from the perspective of optimizing the solution trajectory of a neural differential equation in a greedy manner. We present such a strategy as a unifying view on training-free guidance by showing that the greedy strategy is a first-order discretization of end-to-end optimization techniques. We show that a greedy guidance strategy makes good decisions and compare it to a guidance strategy using the ideal gradients found via the continuous adjoint equations. We then show how other popular training-free guidance strategies can be viewed in a unified manner from this perspective.

Title: The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models

Authors: Artem Kirsanov, Chi-Ning Chou, Kyunghyun Cho, SueYeon Chung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.08009
Pdf URL: https://arxiv.org/pdf/2502.08009
Copy Paste: [[2502.08009]] The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models(https://arxiv.org/abs/2502.08009)
Keywords: in-context
Abstract: Decoder-only language models have the ability to dynamically switch between various computational tasks based on input prompts. Despite many successful applications of prompting, there is very limited understanding of the internal mechanism behind such flexibility. In this work, we investigate how different prompting methods affect the geometry of representations in these models. Employing a framework grounded in statistical physics, we reveal that various prompting techniques, while achieving similar performance, operate through distinct representational mechanisms for task adaptation. Our analysis highlights the critical role of input distribution samples and label semantics in few-shot in-context learning. We also demonstrate evidence of synergistic and interfering interactions between different tasks on the representational level. Our work contributes to the theoretical understanding of large language models and lays the groundwork for developing more effective, representation-aware prompting strategies.

Title: Franken-Adapter: Cross-Lingual Adaptation of LLMs by Embedding Surgery

Authors: Fan Jiang, Honglin Yu, Grace Chung, Trevor Cohn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.08037
Pdf URL: https://arxiv.org/pdf/2502.08037
Copy Paste: [[2502.08037]] Franken-Adapter: Cross-Lingual Adaptation of LLMs by Embedding Surgery(https://arxiv.org/abs/2502.08037)
Keywords: generative
Abstract: The capabilities of Large Language Models (LLMs) in low-resource languages lag far behind those in English, making their universal accessibility a significant challenge. To alleviate this, we present $\textit{Franken-Adapter}$, a modular language adaptation approach for decoder-only LLMs with embedding surgery. Our method begins by creating customized vocabularies for target languages and performing language adaptation through embedding tuning on multilingual data. These pre-trained embeddings are subsequently integrated with LLMs that have been instruction-tuned on English alignment data to enable zero-shot cross-lingual transfer. Our experiments on $\texttt{Gemma2}$ models with up to 27B parameters demonstrate improvements of up to 20% across 96 languages, spanning both discriminative and generative tasks, with minimal regressions ($<$1%) in English. Further in-depth analysis reveals the critical role of customizing tokenizers in enhancing language adaptation, while boosting inference efficiency. Additionally, we show the versatility of our method by achieving a 14% improvement over a math-optimized LLM across 20 languages, offering a modular solution to transfer reasoning abilities across languages post hoc.

Title: Out-of-Distribution Detection on Graphs: A Survey

Authors: Tingyi Cai, Yunliang Jiang, Yixin Liu, Ming Li, Changqin Huang, Shirui Pan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.08105
Pdf URL: https://arxiv.org/pdf/2502.08105
Copy Paste: [[2502.08105]] Out-of-Distribution Detection on Graphs: A Survey(https://arxiv.org/abs/2502.08105)
Keywords: anomaly
Abstract: Graph machine learning has witnessed rapid growth, driving advancements across diverse domains. However, the in-distribution assumption, where training and testing data share the same distribution, often breaks in real-world scenarios, leading to degraded model performance under distribution shifts. This challenge has catalyzed interest in graph out-of-distribution (GOOD) detection, which focuses on identifying graph data that deviates from the distribution seen during training, thereby enhancing model robustness. In this paper, we provide a rigorous definition of GOOD detection and systematically categorize existing methods into four types: enhancement-based, reconstruction-based, information propagation-based, and classification-based approaches. We analyze the principles and mechanisms of each approach and clarify the distinctions between GOOD detection and related fields, such as graph anomaly detection, outlier detection, and GOOD generalization. Beyond methodology, we discuss practical applications and theoretical foundations, highlighting the unique challenges posed by graph data. Finally, we discuss the primary challenges and propose future directions to advance this emerging field. The repository of this survey is available at this https URL.

Title: PoGDiff: Product-of-Gaussians Diffusion Models for Imbalanced Text-to-Image Generation

Authors: Ziyan Wang, Sizhe Wei, Xiaoming Huo, Hao Wang
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2502.08106
Pdf URL: https://arxiv.org/pdf/2502.08106
Copy Paste: [[2502.08106]] PoGDiff: Product-of-Gaussians Diffusion Models for Imbalanced Text-to-Image Generation(https://arxiv.org/abs/2502.08106)
Keywords: diffusion
Abstract: Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.

Title: In-Context Learning of Linear Dynamical Systems with Transformers: Error Bounds and Depth-Separation

Authors: Frank Cole, Yulong Lu, Tianhao Zhang, Yuxuan Zhao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.08136
Pdf URL: https://arxiv.org/pdf/2502.08136
Copy Paste: [[2502.08136]] In-Context Learning of Linear Dynamical Systems with Transformers: Error Bounds and Depth-Separation(https://arxiv.org/abs/2502.08136)
Keywords: in-context
Abstract: This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.

Title: Force Matching with Relativistic Constraints: A Physics-Inspired Approach to Stable and Efficient Generative Modeling

Authors: Yang Cao, Bo Chen, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.08150
Pdf URL: https://arxiv.org/pdf/2502.08150
Copy Paste: [[2502.08150]] Force Matching with Relativistic Constraints: A Physics-Inspired Approach to Stable and Efficient Generative Modeling(https://arxiv.org/abs/2502.08150)
Keywords: generative
Abstract: This paper introduces Force Matching (ForM), a novel framework for generative modeling that represents an initial exploration into leveraging special relativistic mechanics to enhance the stability of the sampling process. By incorporating the Lorentz factor, ForM imposes a velocity constraint, ensuring that sample velocities remain bounded within a constant limit. This constraint serves as a fundamental mechanism for stabilizing the generative dynamics, leading to a more robust and controlled sampling process. We provide a rigorous theoretical analysis demonstrating that the velocity constraint is preserved throughout the sampling procedure within the ForM framework. To validate the effectiveness of our approach, we conduct extensive empirical evaluations. On the \textit{half-moons} dataset, ForM significantly outperforms baseline methods, achieving the lowest Euclidean distance loss of \textbf{0.714}, in contrast to vanilla first-order flow matching (5.853) and first- and second-order flow matching (5.793). Additionally, we perform an ablation study to further investigate the impact of our velocity constraint, reaffirming the superiority of ForM in stabilizing the generative process. The theoretical guarantees and empirical results underscore the potential of integrating special relativity principles into generative modeling. Our findings suggest that ForM provides a promising pathway toward achieving stable, efficient, and flexible generative processes. This work lays the foundation for future advancements in high-dimensional generative modeling, opening new avenues for the application of physical principles in machine learning.

Title: DNNs May Determine Major Properties of Their Outputs Early, with Timing Possibly Driven by Bias

Authors: Song Park, Sanghyuk Chun, Byeongho Heo, Dongyoon Han
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.08167
Pdf URL: https://arxiv.org/pdf/2502.08167
Copy Paste: [[2502.08167]] DNNs May Determine Major Properties of Their Outputs Early, with Timing Possibly Driven by Bias(https://arxiv.org/abs/2502.08167)
Keywords: diffusion
Abstract: This paper argues that deep neural networks (DNNs) mostly determine their outputs during the early stages of inference, where biases inherent in the model play a crucial role in shaping this process. We draw a parallel between this phenomenon and human decision-making, which often relies on fast, intuitive heuristics. Using diffusion models (DMs) as a case study, we demonstrate that DNNs often make early-stage decision-making influenced by the type and extent of bias in their design and training. Our findings offer a new perspective on bias mitigation, efficient inference, and the interpretation of machine learning systems. By identifying the temporal dynamics of decision-making in DNNs, this paper aims to inspire further discussion and research within the machine learning community.

Title: ActiveSSF: An Active-Learning-Guided Self-Supervised Framework for Long-Tailed Megakaryocyte Classification

Authors: Linghao Zhuang, Ying Zhang, Gege Yuan, Xingyue Zhao, Zhiping Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08200
Pdf URL: https://arxiv.org/pdf/2502.08200
Copy Paste: [[2502.08200]] ActiveSSF: An Active-Learning-Guided Self-Supervised Framework for Long-Tailed Megakaryocyte Classification(https://arxiv.org/abs/2502.08200)
Keywords: self-supervised
Abstract: Precise classification of megakaryocytes is crucial for diagnosing myelodysplastic syndromes. Although self-supervised learning has shown promise in medical image analysis, its application to classifying megakaryocytes in stained slides faces three main challenges: (1) pervasive background noise that obscures cellular details, (2) a long-tailed distribution that limits data for rare subtypes, and (3) complex morphological variations leading to high intra-class variability. To address these issues, we propose the ActiveSSF framework, which integrates active learning with self-supervised pretraining. Specifically, our approach employs Gaussian filtering combined with K-means clustering and HSV analysis (augmented by clinical prior knowledge) for accurate region-of-interest extraction; an adaptive sample selection mechanism that dynamically adjusts similarity thresholds to mitigate class imbalance; and prototype clustering on labeled samples to overcome morphological complexity. Experimental results on clinical megakaryocyte datasets demonstrate that ActiveSSF not only achieves state-of-the-art performance but also significantly improves recognition accuracy for rare subtypes. Moreover, the integration of these advanced techniques further underscores the practical potential of ActiveSSF in clinical settings. To foster further research, the code and datasets will be publicly released in the future.

Title: Equivariant Masked Position Prediction for Efficient Molecular Representation

Authors: Junyi An, Chao Qu, Yun-Fei Shi, XinHao Liu, Qianwei Tang, Fenglei Cao, Yuan Qi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.08209
Pdf URL: https://arxiv.org/pdf/2502.08209
Copy Paste: [[2502.08209]] Equivariant Masked Position Prediction for Efficient Molecular Representation(https://arxiv.org/abs/2502.08209)
Keywords: self-supervised
Abstract: Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs' ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Our code is released in this https URL.

Title: FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

Authors: Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08244
Pdf URL: https://arxiv.org/pdf/2502.08244
Copy Paste: [[2502.08244]] FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis(https://arxiv.org/abs/2502.08244)
Keywords: diffusion
Abstract: This paper presents FloVD, a novel optical-flow-based video diffusion model for camera-controllable video generation. FloVD leverages optical flow maps to represent motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.

Title: UniCoRN: Unified Commented Retrieval Network with LMMs

Authors: Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08254
Pdf URL: https://arxiv.org/pdf/2502.08254
Copy Paste: [[2502.08254]] UniCoRN: Unified Commented Retrieval Network with LMMs(https://arxiv.org/abs/2502.08254)
Keywords: generative
Abstract: Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.

Title: GenIAS: Generator for Instantiating Anomalies in time Series

Authors: Zahra Zamanzadeh Darban, Qizhou Wang, Geoffrey I. Webb, Shirui Pan, Charu C. Aggarwal, Mahsa Salehi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.08262
Pdf URL: https://arxiv.org/pdf/2502.08262
Copy Paste: [[2502.08262]] GenIAS: Generator for Instantiating Anomalies in time Series(https://arxiv.org/abs/2502.08262)
Keywords: generative, anomaly
Abstract: A recent and promising approach for building time series anomaly detection (TSAD) models is to inject synthetic samples of anomalies within real data sets. The existing injection mechanisms have significant limitations - most of them rely on ad hoc, hand-crafted strategies which fail to capture the natural diversity of anomalous patterns, or are restricted to univariate time series settings. To address these challenges, we design a generative model for TSAD using a variational autoencoder, which is referred to as a Generator for Instantiating Anomalies in Time Series (GenIAS). GenIAS is designed to produce diverse and realistic synthetic anomalies for TSAD tasks. By employing a novel learned perturbation mechanism in the latent space and injecting the perturbed patterns in different segments of time series, GenIAS can generate anomalies with greater diversity and varying scales. Further, guided by a new triplet loss function, which uses a min-max margin and a new variance-scaling approach to further enforce the learning of compact normal patterns, GenIAS ensures that anomalies are distinct from normal samples while remaining realistic. The approach is effective for both univariate and multivariate time series. We demonstrate the diversity and realism of the generated anomalies. Our extensive experiments demonstrate that GenIAS - when integrated into a TSAD task - consistently outperforms seventeen traditional and deep anomaly detection models, thereby highlighting the potential of generative models for time series anomaly generation.

Title: HDT: Hierarchical Discrete Transformer for Multivariate Time Series Forecasting

Authors: Shibo Feng, Peilin Zhao, Liu Liu, Pengcheng Wu, Zhiqi Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.08302
Pdf URL: https://arxiv.org/pdf/2502.08302
Copy Paste: [[2502.08302]] HDT: Hierarchical Discrete Transformer for Multivariate Time Series Forecasting(https://arxiv.org/abs/2502.08302)
Keywords: generative
Abstract: Generative models have gained significant attention in multivariate time series forecasting (MTS), particularly due to their ability to generate high-fidelity samples. Forecasting the probability distribution of multivariate time series is a challenging yet practical task. Although some recent attempts have been made to handle this task, two major challenges persist: 1) some existing generative methods underperform in high-dimensional multivariate time series forecasting, which is hard to scale to higher dimensions; 2) the inherent high-dimensional multivariate attributes constrain the forecasting lengths of existing generative models. In this paper, we point out that discrete token representations can model high-dimensional MTS with faster inference time, and forecasting the target with long-term trends of itself can extend the forecasting length with high accuracy. Motivated by this, we propose a vector quantized framework called Hierarchical Discrete Transformer (HDT) that models time series into discrete token representations with l2 normalization enhanced vector quantized strategy, in which we transform the MTS forecasting into discrete tokens generation. To address the limitations of generative models in long-term forecasting, we propose a hierarchical discrete Transformer. This model captures the discrete long-term trend of the target at the low level and leverages this trend as a condition to generate the discrete representation of the target at the high level that introduces the features of the target itself to extend the forecasting length in high-dimensional MTS. Extensive experiments on five popular MTS datasets verify the effectiveness of our proposed method.

Title: Screener: Self-supervised Pathology Segmentation Model for 3D Medical Images

Authors: Mikhail Goncharov, Eugenia Soboleva, Mariia Donskova, Ivan Oseledets, Marina Munkhoeva, Maxim Panov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08321
Pdf URL: https://arxiv.org/pdf/2502.08321
Copy Paste: [[2502.08321]] Screener: Self-supervised Pathology Segmentation Model for 3D Medical Images(https://arxiv.org/abs/2502.08321)
Keywords: self-supervised, anomaly
Abstract: Accurate segmentation of all pathological findings in 3D medical images remains a significant challenge, as supervised models are limited to detecting only the few pathology classes annotated in existing datasets. To address this, we frame pathology segmentation as an unsupervised visual anomaly segmentation (UVAS) problem, leveraging the inherent rarity of pathological patterns compared to healthy ones. We enhance the existing density-based UVAS framework with two key innovations: (1) dense self-supervised learning (SSL) for feature extraction, eliminating the need for supervised pre-training, and (2) learned, masking-invariant dense features as conditioning variables, replacing hand-crafted positional encodings. Trained on over 30,000 unlabeled 3D CT volumes, our model, Screener, outperforms existing UVAS methods on four large-scale test datasets comprising 1,820 scans with diverse pathologies. Code and pre-trained models will be made publicly available.

Title: Foundation Models in Computational Pathology: A Review of Challenges, Opportunities, and Impact

Authors: Mohsin Bilal, Aadam, Manahil Raza, Youssef Altherwy, Anas Alsuhaibani, Abdulrahman Abduljabbar, Fahdah Almarshad, Paul Golding, Nasir Rajpoot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08333
Pdf URL: https://arxiv.org/pdf/2502.08333
Copy Paste: [[2502.08333]] Foundation Models in Computational Pathology: A Review of Challenges, Opportunities, and Impact(https://arxiv.org/abs/2502.08333)
Keywords: self-supervised, foundation model, generative
Abstract: From self-supervised, vision-only models to contrastive visual-language frameworks, computational pathology has rapidly evolved in recent years. Generative AI "co-pilots" now demonstrate the ability to mine subtle, sub-visual tissue cues across the cellular-to-pathology spectrum, generate comprehensive reports, and respond to complex user queries. The scale of data has surged dramatically, growing from tens to millions of multi-gigapixel tissue images, while the number of trainable parameters in these models has risen to several billion. The critical question remains: how will this new wave of generative and multi-purpose AI transform clinical diagnostics? In this article, we explore the true potential of these innovations and their integration into clinical practice. We review the rapid progress of foundation models in pathology, clarify their applications and significance. More precisely, we examine the very definition of foundational models, identifying what makes them foundational, general, or multipurpose, and assess their impact on computational pathology. Additionally, we address the unique challenges associated with their development and evaluation. These models have demonstrated exceptional predictive and generative capabilities, but establishing global benchmarks is crucial to enhancing evaluation standards and fostering their widespread clinical adoption. In computational pathology, the broader impact of frontier AI ultimately depends on widespread adoption and societal acceptance. While direct public exposure is not strictly necessary, it remains a powerful tool for dispelling misconceptions, building trust, and securing regulatory support.

Title: Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Authors: Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.08363
Pdf URL: https://arxiv.org/pdf/2502.08363
Copy Paste: [[2502.08363]] Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding(https://arxiv.org/abs/2502.08363)
Keywords: generative
Abstract: The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.

Title: A Survey on Pre-Trained Diffusion Model Distillations

Authors: Xuhui Fan, Zhangkai Wu, Hongyu Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.08364
Pdf URL: https://arxiv.org/pdf/2502.08364
Copy Paste: [[2502.08364]] A Survey on Pre-Trained Diffusion Model Distillations(https://arxiv.org/abs/2502.08364)
Keywords: diffusion, generative
Abstract: Diffusion Models~(DMs) have emerged as the dominant approach in Generative Artificial Intelligence (GenAI), owing to their remarkable performance in tasks such as text-to-image synthesis. However, practical DMs, such as stable diffusion, are typically trained on massive datasets and thus usually require large storage. At the same time, many steps may be required, i.e., recursively evaluating the trained neural network, to generate a high-quality image, which results in significant computational costs during sample generation. As a result, distillation methods on pre-trained DM have become widely adopted practices to develop smaller, more efficient models capable of rapid, few-step generation in low-resource environment. When these distillation methods are developed from different perspectives, there is an urgent need for a systematic survey, particularly from a methodological perspective. In this survey, we review distillation methods through three aspects: output loss distillation, trajectory distillation and adversarial distillation. We also discuss current challenges and outline future research directions in the conclusion.

Title: One-Shot Federated Learning with Classifier-Free Diffusion Models

Authors: Obaidullah Zaland, Shutong Jin, Florian T. Pokorny, Monowar Bhuyan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.08488
Pdf URL: https://arxiv.org/pdf/2502.08488
Copy Paste: [[2502.08488]] One-Shot Federated Learning with Classifier-Free Diffusion Models(https://arxiv.org/abs/2502.08488)
Keywords: diffusion, foundation model
Abstract: Federated learning (FL) enables collaborative learning without data centralization but introduces significant communication costs due to multiple communication rounds between clients and the server. One-shot federated learning (OSFL) addresses this by forming a global model with a single communication round, often relying on the server's model distillation or auxiliary dataset generation - often through pre-trained diffusion models (DMs). Existing DM-assisted OSFL methods, however, typically employ classifier-guided DMs, which require training auxiliary classifier models at each client, introducing additional computation overhead. This work introduces OSCAR (One-Shot Federated Learning with Classifier-Free Diffusion Models), a novel OSFL approach that eliminates the need for auxiliary models. OSCAR uses foundation models to devise category-specific data representations at each client, seamlessly integrated into a classifier-free diffusion model pipeline for server-side data generation. OSCAR is a simple yet cost-effective OSFL approach that outperforms the state-of-the-art on four benchmarking datasets while reducing the communication load by at least 99%.

Title: Explanation based In-Context Demonstrations Retrieval for Multilingual Grammatical Error Correction

Authors: Wei Li, Wen Luo, Guangyue Peng, Houfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.08507
Pdf URL: https://arxiv.org/pdf/2502.08507
Copy Paste: [[2502.08507]] Explanation based In-Context Demonstrations Retrieval for Multilingual Grammatical Error Correction(https://arxiv.org/abs/2502.08507)
Keywords: in-context
Abstract: Grammatical error correction (GEC) aims to correct grammatical, spelling, and semantic errors in natural language text. With the growing of large language models (LLMs), direct text generation has gradually become the focus of the GEC methods, and few-shot in-context learning presents a cost-effective solution. However, selecting effective in-context examples remains challenging, as the similarity between input texts does not necessarily correspond to similar grammatical error patterns. In this paper, we propose a novel retrieval method based on natural language grammatical error explanations (GEE) to address this issue. Our method retrieves suitable few-shot demonstrations by matching the GEE of the test input with that of pre-constructed database samples, where explanations for erroneous samples are generated by LLMs. We conducted multilingual GEC few-shot experiments on both major open-source and closed-source LLMs. Experiments across five languages show that our method outperforms existing semantic and BM25-based retrieval techniques, without requiring additional training or language adaptation. This also suggests that matching error patterns is key to selecting examples.

Title: FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices

Authors: Dezhong Yao, Yuexin Shi, Tongtong Liu, Zhiqiang Xu
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2502.08518
Pdf URL: https://arxiv.org/pdf/2502.08518
Copy Paste: [[2502.08518]] FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices(https://arxiv.org/abs/2502.08518)
Keywords: generative
Abstract: Federated Learning (FL) is increasingly adopted in edge computing scenarios, where a large number of heterogeneous clients operate under constrained or sufficient resources. The iterative training process in conventional FL introduces significant computation and communication overhead, which is unfriendly for resource-constrained edge devices. One-shot FL has emerged as a promising approach to mitigate communication overhead, and model-heterogeneous FL solves the problem of diverse computing resources across clients. However, existing methods face challenges in effectively managing model-heterogeneous one-shot FL, often leading to unsatisfactory global model performance or reliance on auxiliary datasets. To address these challenges, we propose a novel FL framework named FedMHO, which leverages deep classification models on resource-sufficient clients and lightweight generative models on resource-constrained devices. On the server side, FedMHO involves a two-stage process that includes data generation and knowledge fusion. Furthermore, we introduce FedMHO-MD and FedMHO-SD to mitigate the knowledge-forgetting problem during the knowledge fusion stage, and an unsupervised data optimization solution to improve the quality of synthetic samples. Comprehensive experiments demonstrate the effectiveness of our methods, as they outperform state-of-the-art baselines in various experimental setups.

Title: LLMs can implicitly learn from mistakes in-context

Authors: Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Yi Chern Tan, Marek Rei, Max Bartolo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.08550
Pdf URL: https://arxiv.org/pdf/2502.08550
Copy Paste: [[2502.08550]] LLMs can implicitly learn from mistakes in-context(https://arxiv.org/abs/2502.08550)
Keywords: in-context
Abstract: Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.

Title: Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

Authors: Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, Wanli Ouyang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2502.08556
Pdf URL: https://arxiv.org/pdf/2502.08556
Copy Paste: [[2502.08556]] Human-Centric Foundation Models: Perception, Generation and Agentic Modeling(https://arxiv.org/abs/2502.08556)
Keywords: foundation model
Abstract: Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.

Title: Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion

Authors: Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.08560
Pdf URL: https://arxiv.org/pdf/2502.08560
Copy Paste: [[2502.08560]] Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion(https://arxiv.org/abs/2502.08560)
Keywords: diffusion
Abstract: The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.

Title: Ultrasound Image Generation using Latent Diffusion Models

Authors: Benoit Freiche, Anthony El-Khoury, Ali Nasiri-Sarvi, Mahdi S. Hosseini, Damien Garcia, Adrian Basarab, Mathieu Boily, Hassan Rivaz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08580
Pdf URL: https://arxiv.org/pdf/2502.08580
Copy Paste: [[2502.08580]] Ultrasound Image Generation using Latent Diffusion Models(https://arxiv.org/abs/2502.08580)
Keywords: diffusion
Abstract: Diffusion models for image generation have been a subject of increasing interest due to their ability to generate diverse, high-quality images. Image generation has immense potential in medical imaging because open-source medical images are difficult to obtain compared to natural images, especially for rare conditions. The generated images can be used later to train classification and segmentation models. In this paper, we propose simulating realistic ultrasound (US) images by successive fine-tuning of large diffusion models on different publicly available databases. To do so, we fine-tuned Stable Diffusion, a state-of-the-art latent diffusion model, on BUSI (Breast US Images) an ultrasound breast image dataset. We successfully generated high-quality US images of the breast using simple prompts that specify the organ and pathology, which appeared realistic to three experienced US scientists and a US radiologist. Additionally, we provided user control by conditioning the model with segmentations through ControlNet. We will release the source code at this http URL to allow fast US image generation to the scientific community.

Title: Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Authors: Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Anyi Rao, Jiaqi Wang, Li Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08590
Pdf URL: https://arxiv.org/pdf/2502.08590
Copy Paste: [[2502.08590]] Light-A-Video: Training-free Video Relighting via Progressive Light Fusion(https://arxiv.org/abs/2502.08590)
Keywords: diffusion
Abstract: Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: this https URL.

Title: Enhancing Diffusion Models Efficiency by Disentangling Total-Variance and Signal-to-Noise Ratio

Authors: Khaled Kahouli, Winfried Ripken, Stefan Gugler, Oliver T. Unke, Klaus-Robert Müller, Shinichi Nakajima
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.08598
Pdf URL: https://arxiv.org/pdf/2502.08598
Copy Paste: [[2502.08598]] Enhancing Diffusion Models Efficiency by Disentangling Total-Variance and Signal-to-Noise Ratio(https://arxiv.org/abs/2502.08598)
Keywords: diffusion
Abstract: The long sampling time of diffusion models remains a significant bottleneck, which can be mitigated by reducing the number of diffusion time steps. However, the quality of samples with fewer steps is highly dependent on the noise schedule, i.e., the specific manner in which noise is introduced and the signal is reduced at each step. Although prior work has improved upon the original variance-preserving and variance-exploding schedules, these approaches $\textit{passively}$ adjust the total variance, without direct control over it. In this work, we propose a novel total-variance/signal-to-noise-ratio disentangled (TV/SNR) framework, where TV and SNR can be controlled independently. Our approach reveals that different existing schedules, where the TV explodes exponentially, can be $\textit{improved}$ by setting a constant TV schedule while preserving the same SNR schedule. Furthermore, generalizing the SNR schedule of the optimal transport flow matching significantly improves the performance in molecular structure generation, achieving few step generation of stable molecules. A similar tendency is observed in image generation, where our approach with a uniform diffusion time grid performs comparably to the highly tailored EDM sampler.

Title: CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection

Authors: Karish Grover, Geoffrey J. Gordon, Christos Faloutsos
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.08605
Pdf URL: https://arxiv.org/pdf/2502.08605
Copy Paste: [[2502.08605]] CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection(https://arxiv.org/abs/2502.08605)
Keywords: anomaly
Abstract: Does the intrinsic curvature of complex networks hold the key to unveiling graph anomalies that conventional approaches overlook? Reconstruction-based graph anomaly detection (GAD) methods overlook such geometric outliers, focusing only on structural and attribute-level anomalies. To this end, we propose CurvGAD - a mixed-curvature graph autoencoder that introduces the notion of curvature-based geometric anomalies. CurvGAD introduces two parallel pipelines for enhanced anomaly interpretability: (1) Curvature-equivariant geometry reconstruction, which focuses exclusively on reconstructing the edge curvatures using a mixed-curvature, Riemannian encoder and Gaussian kernel-based decoder; and (2) Curvature-invariant structure and attribute reconstruction, which decouples structural and attribute anomalies from geometric irregularities by regularizing graph curvature under discrete Ollivier-Ricci flow, thereby isolating the non-geometric anomalies. By leveraging curvature, CurvGAD refines the existing anomaly classifications and identifies new curvature-driven anomalies. Extensive experimentation over 10 real-world datasets (both homophilic and heterophilic) demonstrates an improvement of up to 6.5% over state-of-the-art GAD methods.

Title: Continuous Cardiac Arrest Prediction in ICU using PPG Foundation Model

Authors: Saurabh Kataria, Ran Xiao, Timothy Ruchti, Matthew Clark, Jiaying Lu, Randall J. Lee, Jocelyn Grunwell, Xiao Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.08612
Pdf URL: https://arxiv.org/pdf/2502.08612
Copy Paste: [[2502.08612]] Continuous Cardiac Arrest Prediction in ICU using PPG Foundation Model(https://arxiv.org/abs/2502.08612)
Keywords: foundation model
Abstract: Non-invasive patient monitoring for tracking and predicting adverse acute health events is an emerging area of research. We pursue in-hospital cardiac arrest (IHCA) prediction using only single-channel finger photoplethysmography (PPG) signals. Our proposed two-stage model Feature Extractor-Aggregator Network (FEAN) leverages powerful representations from pre-trained PPG foundation models (PPG-GPT of size up to 1 Billion) stacked with sequential classification models. We propose two FEAN variants ("1H", "FH") which use the latest one-hour and (max) 24-hour history to make decisions respectively. Our study is the first to present IHCA prediction results in ICU patients using only unimodal (continuous PPG signal) waveform deep representations. With our best model, we obtain an average of 0.79 AUROC over 24~h prediction window before CA event onset with our model peaking performance at 0.82 one hour before CA. We also provide a comprehensive analysis of our model through architectural tuning and PaCMAP visualization of patient health trajectory in latent space.

Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

Authors: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08639
Pdf URL: https://arxiv.org/pdf/2502.08639
Copy Paste: [[2502.08639]] CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation(https://arxiv.org/abs/2502.08639)
Keywords: diffusion
Abstract: In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: this https URL.

Title: SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

Authors: Ellie Arar, Yarden Frenkel, Daniel Cohen-Or, Ariel Shamir, Yael Vinker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.08642
Pdf URL: https://arxiv.org/pdf/2502.08642
Copy Paste: [[2502.08642]] SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation(https://arxiv.org/abs/2502.08642)
Keywords: diffusion
Abstract: Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.