2025-03-12

Title: Is Pre-training Applicable to the Decoder for Dense Prediction?

Authors: Chao Ning, Wanshui Gan, Weihao Xuan, Naoto Yokoya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07637
Pdf URL: https://arxiv.org/pdf/2503.07637
Copy Paste: [[2503.07637]] Is Pre-training Applicable to the Decoder for Dense Prediction?(https://arxiv.org/abs/2503.07637)
Keywords: self-supervised
Abstract: Pre-trained encoders are widely employed in dense prediction tasks for their capability to effectively extract visual features from images. The decoder subsequently processes these features to generate pixel-level predictions. However, due to structural differences and variations in input data, only encoders benefit from pre-learned representations from vision benchmarks such as image classification and self-supervised learning, while decoders are typically trained from scratch. In this paper, we introduce $\times$Net, which facilitates a "pre-trained encoder $\times$ pre-trained decoder" collaboration through three innovative designs. $\times$Net enables the direct utilization of pre-trained models within the decoder, integrating pre-learned representations into the decoding process to enhance performance in dense prediction tasks. By simply coupling the pre-trained encoder and pre-trained decoder, $\times$Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, $\times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation. and semantic segmentation, achieving state-of-the-art results, especially in monocular depth estimation. embedding algorithms. Despite its streamlined design, $\times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation.

Title: BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification

Authors: Jing Zhang, Xiaowei Yu, Tong Chen, Chao Cao, Mingheng Chen, Yan Zhuang, Yanjun Lyu, Lu Zhang, Li Su, Tianming Liu, Dajiang Zhu
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.07640
Pdf URL: https://arxiv.org/pdf/2503.07640
Copy Paste: [[2503.07640]] BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification(https://arxiv.org/abs/2503.07640)
Keywords: generative
Abstract: The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing "neural-level networks". Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain's hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions.

Title: The day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model

Authors: Changgang Wang, Wei Liu, Yu Cao, Dong Liang, Yang Li, Jingshan Mo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07648
Pdf URL: https://arxiv.org/pdf/2503.07648
Copy Paste: [[2503.07648]] The day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model(https://arxiv.org/abs/2503.07648)
Keywords: diffusion, generative
Abstract: In the context of the rising share of new energy generation, accurately generating new energy output scenarios is crucial for day-ahead power system scheduling. Deep learning-based scenario generation methods can address this need, but their black-box nature raises concerns about interpretability. To tackle this issue, this paper introduces a method for day-ahead new energy scenario generation based on an improved conditional generative diffusion model. This method is built on the theoretical framework of Markov chains and variational inference. It first transforms historical data into pure noise through a diffusion process, then uses conditional information to guide the denoising process, ultimately generating scenarios that satisfy the conditional distribution. Additionally, the noise table is improved to a cosine form, enhancing the quality of the generated scenarios. When applied to actual wind and solar output data, the results demonstrate that this method effectively generates new energy output scenarios with good adaptability.

Title: TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

Authors: Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Y. Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma, Yuriy Nevmyvaka, Dongjin Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07649
Pdf URL: https://arxiv.org/pdf/2503.07649
Copy Paste: [[2503.07649]] TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster(https://arxiv.org/abs/2503.07649)
Keywords: foundation model
Abstract: Recently, Large Language Models (LLMs) and Foundation Models (FMs) have become prevalent for time series forecasting tasks. However, fine-tuning large language models (LLMs) for forecasting enables the adaptation to specific domains but may not generalize well across diverse, unseen datasets. Meanwhile, existing time series foundation models (TSFMs) lack inherent mechanisms for domain adaptation and suffer from limited interpretability, making them suboptimal for zero-shot forecasting. To this end, we present TS-RAG, a retrieval-augmented generation based time series forecasting framework that enhances the generalization capability and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant time series segments from a dedicated knowledge database, incorporating contextual patterns for the given time series query. Next, we develop a learnable Mixture-of-Experts (MoE)-based augmentation module, which dynamically fuses retrieved time series patterns with the TSFM's representation of the input query, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming TSFMs by up to 6.51% across diverse domains and showcasing desired interpretability.

Title: Data Foundations for Large Scale Multimodal Clinical Foundation Models

Authors: Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang
Subjects: cs.LG, cs.AI, cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.07667
Pdf URL: https://arxiv.org/pdf/2503.07667
Copy Paste: [[2503.07667]] Data Foundations for Large Scale Multimodal Clinical Foundation Models(https://arxiv.org/abs/2503.07667)
Keywords: foundation model
Abstract: Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at this https URL.

Title: PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Authors: Kwanyoung Kim, Byeongsu Sim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07677
Pdf URL: https://arxiv.org/pdf/2503.07677
Copy Paste: [[2503.07677]] PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity(https://arxiv.org/abs/2503.07677)
Keywords: diffusion
Abstract: Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.

Title: A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph

Authors: Shule Hao, Junpeng Bao, Chuncheng Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07682
Pdf URL: https://arxiv.org/pdf/2503.07682
Copy Paste: [[2503.07682]] A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph(https://arxiv.org/abs/2503.07682)
Keywords: anomaly
Abstract: Time series analysis is crucial in fields like finance, transportation, and industry. However, traditional models often focus solely on temporal features, limiting their ability to capture underlying information. This paper proposes a novel time series multitask framework, called LTM, which integrates temporal features with textual descriptions to enhance analytical and predictive capabilities. LTM combines pre-trained time series model, large language model (LLM), and knowledge graph to tackle time series tasks, including forecasting, imputation, and anomaly detection. LTM achieves improved performance with a few trainable parameters. It is very efficient and practical. LTM encodes time series data into patches and enriches user-provided prompts using knowledge graphs to generate enhanced prompts. A novel feature fusion method embeds prompts into each patch encoding, which is processed by a frozen LLM, followed by a feature enhancement module and a time decoder module. During fine-tuning stage, cosine similarity between prompts and temporal patches is integrated into the loss function to boost performance. Experiments on benchmark datasets show that LTM significantly outperforms existing methods. It provides a robust and versatile solution for time series tasks.

Title: RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Authors: Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, Xuefeng Xiao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07699
Pdf URL: https://arxiv.org/pdf/2503.07699
Copy Paste: [[2503.07699]] RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories(https://arxiv.org/abs/2503.07699)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method minimizes sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.

Title: Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Authors: Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07703
Pdf URL: https://arxiv.org/pdf/2503.07703
Copy Paste: [[2503.07703]] Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model(https://arxiv.org/abs/2503.07703)
Keywords: diffusion, foundation model
Abstract: Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

Title: SIRE: SE(3) Intrinsic Rigidity Embeddings

Authors: Cameron Smith, Basile Van Hoorick, Vitor Guizilini, Yue Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07739
Pdf URL: https://arxiv.org/pdf/2503.07739
Copy Paste: [[2503.07739]] SIRE: SE(3) Intrinsic Rigidity Embeddings(https://arxiv.org/abs/2503.07739)
Keywords: self-supervised
Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.

Title: Self-supervised Normality Learning and Divergence Vector-guided Model Merging for Zero-shot Congenital Heart Disease Detection in Fetal Ultrasound Videos

Authors: Pramit Saha, Divyanshu Mishra, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris Papageorghiou, Yuki M. Asano, J. Alison Noble
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07799
Pdf URL: https://arxiv.org/pdf/2503.07799
Copy Paste: [[2503.07799]] Self-supervised Normality Learning and Divergence Vector-guided Model Merging for Zero-shot Congenital Heart Disease Detection in Fetal Ultrasound Videos(https://arxiv.org/abs/2503.07799)
Keywords: self-supervised, anomaly
Abstract: Congenital Heart Disease (CHD) is one of the leading causes of fetal mortality, yet the scarcity of labeled CHD data and strict privacy regulations surrounding fetal ultrasound (US) imaging present significant challenges for the development of deep learning-based models for CHD detection. Centralised collection of large real-world datasets for rare conditions, such as CHD, from large populations requires significant co-ordination and resource. In addition, data governance rules increasingly prevent data sharing between sites. To address these challenges, we introduce, for the first time, a novel privacy-preserving, zero-shot CHD detection framework that formulates CHD detection as a normality modeling problem integrated with model merging. In our framework dubbed Sparse Tube Ultrasound Distillation (STUD), each hospital site first trains a sparse video tube-based self-supervised video anomaly detection (VAD) model on normal fetal heart US clips with self-distillation loss. This enables site-specific models to independently learn the distribution of healthy cases. To aggregate knowledge across the decentralized models while maintaining privacy, we propose a Divergence Vector-Guided Model Merging approach, DivMerge, that combines site-specific models into a single VAD model without data exchange. Our approach preserves domain-agnostic rich spatio-temporal representations, ensuring generalization to unseen CHD cases. We evaluated our approach on real-world fetal US data collected from 5 hospital sites. Our merged model outperformed site-specific models by 23.77% and 30.13% in accuracy and F1-score respectively on external test sets.

Title: TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces

Authors: Guillaume Quétant, Pavlo Molchanov, Slava Voloshynovskiy
Subjects: cs.LG, cs.CV, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2503.07851
Pdf URL: https://arxiv.org/pdf/2503.07851
Copy Paste: [[2503.07851]] TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces(https://arxiv.org/abs/2503.07851)
Keywords: foundation model
Abstract: We present a semi-supervised fine-tuning framework for foundation models that utilises mutual information decomposition to address the challenges of training for a limited amount of labelled data. Our approach derives two distinct lower bounds: i) for the downstream task space, such as classification, optimised using conditional and marginal cross-entropy alongside Kullback-Leibler divergence, and ii) for the latent space representation, regularised and aligned using a contrastive-like decomposition. This fine-tuning strategy retains the pre-trained structure of the foundation model, modifying only a specialised projector module comprising a small transformer and a token aggregation technique. Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions by effectively leveraging unlabelled data.

Title: CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders

Authors: Jongwon Park, Heesoo Jung, Hogun Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07852
Pdf URL: https://arxiv.org/pdf/2503.07852
Copy Paste: [[2503.07852]] CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders(https://arxiv.org/abs/2503.07852)
Keywords: self-supervised
Abstract: Recent Self-Supervised Learning (SSL) methods encapsulating relational information via masking in Graph Neural Networks (GNNs) have shown promising performance. However, most existing approaches rely on random masking strategies in either feature or graph space, which may fail to capture task-relevant information fully. We posit that this limitation stems from an inability to achieve minimum redundancy between masked and unmasked components while ensuring maximum relevance of both to potential downstream tasks. Conditional Independence (CI) inherently satisfies the minimum redundancy and maximum relevance criteria, but its application typically requires access to downstream labels. To address this challenge, we introduce CIMAGE, a novel approach that leverages Conditional Independence to guide an effective masking strategy within the latent space. CIMAGE utilizes CI-aware latent factor decomposition to generate two distinct contexts, leveraging high-confidence pseudo-labels derived from unsupervised graph clustering. In this framework, the pretext task involves reconstructing the masked second context solely from the information provided by the first context. Our theoretical analysis further supports the superiority of CIMAGE's novel CI-aware masking method by demonstrating that the learned embedding exhibits approximate linear separability, which enables accurate predictions for the downstream task. Comprehensive evaluations across diverse graph benchmarks illustrate the advantage of CIMAGE, with notably higher average rankings on node classification and link prediction tasks. Notably, our proposed model highlights the under-explored potential of CI in enhancing graph SSL methodologies and offers enriched insights for effective graph representation learning.

Title: Video Action Differencing

Authors: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07860
Pdf URL: https://arxiv.org/pdf/2503.07860
Copy Paste: [[2503.07860]] Video Action Differencing(https://arxiv.org/abs/2503.07860)
Keywords: foundation model
Abstract: How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at this https URL and code at this http URL.

Title: Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Authors: Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07890
Pdf URL: https://arxiv.org/pdf/2503.07890
Copy Paste: [[2503.07890]] Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?(https://arxiv.org/abs/2503.07890)
Keywords: diffusion, self-supervised, foundation model, generative
Abstract: Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models--which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation--remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. Code will be released.

Title: Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Authors: Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M.Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07920
Pdf URL: https://arxiv.org/pdf/2503.07920
Copy Paste: [[2503.07920]] Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia(https://arxiv.org/abs/2503.07920)
Keywords: generative
Abstract: Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.

Title: CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
Subjects: cs.LG, cs.CV, stat.ME
Abstract URL: https://arxiv.org/abs/2503.07938
Pdf URL: https://arxiv.org/pdf/2503.07938
Copy Paste: [[2503.07938]] CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement(https://arxiv.org/abs/2503.07938)
Keywords: generative
Abstract: While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.

Title: STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision

Authors: Hin Wai Lui, Jeffrey L. Krichmar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07939
Pdf URL: https://arxiv.org/pdf/2503.07939
Copy Paste: [[2503.07939]] STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision(https://arxiv.org/abs/2503.07939)
Keywords: generative
Abstract: This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.

Title: STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications

Authors: Andrew Gao, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07942
Pdf URL: https://arxiv.org/pdf/2503.07942
Copy Paste: [[2503.07942]] STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications(https://arxiv.org/abs/2503.07942)
Keywords: anomaly
Abstract: This paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements, such as autonomous driving, with unparalleled efficiency. As systems like autonomous driving become increasingly popular, ensuring their safety has become more important than ever. Therefore, this paper focuses on how to quickly and effectively detect various anomalies in the aforementioned systems, with the goal of making them safer and more effective. Many detection systems have been developed with great success under spatial contexts; however, there is still significant room for improvement when it comes to temporal context. While there is substantial work regarding this task, there is minimal work done regarding the efficiency of models and their ability to be applied to scenarios that require real-time inference, i.e., autonomous driving where anomalies need to be detected the moment they are within view. To address this gap, we propose STEAD (Spatio-Temporal Efficient Anomaly Detection), whose backbone is developed using (2+1)D Convolutions and Performer Linear Attention, which ensures computational efficiency without sacrificing performance. When tested on the UCF-Crime benchmark, our base model achieves an AUC of 91.34%, outperforming the previous state-of-the-art, and our fast version achieves an AUC of 88.87%, while having 99.70% less parameters and outperforming the previous state-of-the-art as well. The code and pretrained models are made publicly available at this https URL

Title: Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Authors: Wenqiang Zu, Shenghao Xie, Hao Chen, Lei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07958
Pdf URL: https://arxiv.org/pdf/2503.07958
Copy Paste: [[2503.07958]] Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation(https://arxiv.org/abs/2503.07958)
Keywords: self-supervised
Abstract: This paper investigates the critical problem of representation similarity evolution during cross-domain transfer learning, with particular focus on understanding why pre-trained models maintain effectiveness when adapted to medical imaging tasks despite significant domain gaps. The study establishes a rigorous problem definition centered on quantifying and analyzing representation similarity trajectories throughout the fine-tuning process, while carefully delineating the scope to encompass both medical image analysis and broader cross-domain adaptation scenarios. Our empirical findings reveal three critical discoveries: the potential existence of high-performance models that preserve both task accuracy and representation similarity to their pre-trained origins; a robust linear correlation between layer-wise similarity metrics and representation quality indicators; and distinct adaptation patterns that differentiate supervised versus self-supervised pre-training paradigms. The proposed similarity space framework not only provides mechanistic insights into knowledge transfer dynamics but also raises fundamental questions about optimal utilization of pre-trained models. These results advance our understanding of neural network adaptation processes while offering practical implications for transfer learning strategies that extend beyond medical imaging applications. The code will be available once accepted.

Title: Recent Advances in Hypergraph Neural Networks

Authors: Murong Yang, Xin-Jian Xu
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2503.07959
Pdf URL: https://arxiv.org/pdf/2503.07959
Copy Paste: [[2503.07959]] Recent Advances in Hypergraph Neural Networks(https://arxiv.org/abs/2503.07959)
Keywords: generative
Abstract: The growing interest in hypergraph neural networks (HGNNs) is driven by their capacity to capture the complex relationships and patterns within hypergraph structured data across various domains, including computer vision, complex networks, and natural language processing. This paper comprehensively reviews recent advances in HGNNs and presents a taxonomy of mainstream models based on their architectures: hypergraph convolutional networks (HGCNs), hypergraph attention networks (HGATs), hypergraph autoencoders (HGAEs), hypergraph recurrent networks (HGRNs), and deep hypergraph generative models (DHGGMs). For each category, we delve into its practical applications, mathematical mechanisms, literature contributions, and open problems. Finally, we discuss some common challenges and promising research this http URL paper aspires to be a helpful resource that provides guidance for future research and applications of HGNNs.

Title: Regulatory DNA sequence Design with Reinforcement Learning

Authors: Zhao Yang, Bing Su, Chuan Cao, Ji-Rong Wen
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2503.07981
Pdf URL: https://arxiv.org/pdf/2503.07981
Copy Paste: [[2503.07981]] Regulatory DNA sequence Design with Reinforcement Learning(https://arxiv.org/abs/2503.07981)
Keywords: generative
Abstract: Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at this https URL.

Title: DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation

Authors: Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Kyungsu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07982
Pdf URL: https://arxiv.org/pdf/2503.07982
Copy Paste: [[2503.07982]] DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation(https://arxiv.org/abs/2503.07982)
Keywords: diffusion
Abstract: Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG's edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average $+4.4\text{ AP}$ improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by $+1.7\text{ PQ}$. These results demonstrate that DiffEGG's edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at this https URL.

Title: CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

Authors: Zhiyuan Wu, Xibin Song, Senbo Wang, Weizhe Liu, Jiayu Yang, Ziang Cheng, Shenzhou Chen, Taizhang Shang, Weixuan Sun, Shan Luo, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08005
Pdf URL: https://arxiv.org/pdf/2503.08005
Copy Paste: [[2503.08005]] CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction(https://arxiv.org/abs/2503.08005)
Keywords: diffusion
Abstract: 3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.

Title: Exploring Bias in over 100 Text-to-Image Generative Models

Authors: Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08012
Pdf URL: https://arxiv.org/pdf/2503.08012
Copy Paste: [[2503.08012]] Exploring Bias in over 100 Text-to-Image Generative Models(https://arxiv.org/abs/2503.08012)
Keywords: foundation model, generative
Abstract: We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development. Keywords: Bias, Ethical AI, Text-to-Image, Generative Models, Open-Source Models

Title: GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals

Authors: Zhaoliang Chen, Cheng Ding, Saurabh Kataria, Runze Yan, Minxiao Wang, Randall Lee, Xiao Hu
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.08015
Pdf URL: https://arxiv.org/pdf/2503.08015
Copy Paste: [[2503.08015]] GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals(https://arxiv.org/abs/2503.08015)
Keywords: foundation model, generative
Abstract: This study introduces a novel application of a Generative Pre-trained Transformer (GPT) model tailored for photoplethysmography (PPG) signals, serving as a foundation model for various downstream tasks. Adapting the standard GPT architecture to suit the continuous characteristics of PPG signals, our approach demonstrates promising results. Our models are pre-trained on our extensive dataset that contains more than 200 million 30s PPG samples. We explored different supervised fine-tuning techniques to adapt our model to downstream tasks, resulting in performance comparable to or surpassing current state-of-the-art (SOTA) methods in tasks like atrial fibrillation detection. A standout feature of our GPT model is its inherent capability to perform generative tasks such as signal denoising effectively, without the need for further fine-tuning. This success is attributed to the generative nature of the GPT framework.

Title: Partial differential equation system for binarization of degraded document images

Authors: Youjin Liu, Yu Wang
Subjects: cs.CV, math.DS
Abstract URL: https://arxiv.org/abs/2503.08017
Pdf URL: https://arxiv.org/pdf/2503.08017
Copy Paste: [[2503.08017]] Partial differential equation system for binarization of degraded document images(https://arxiv.org/abs/2503.08017)
Keywords: diffusion
Abstract: In recent years, partial differential equation (PDE) systems have been successfully applied to the binarization of text images, achieving promising results. Inspired by the DH model and incorporating a novel image modeling approach, this study proposes a new weakly coupled PDE system for degraded text image binarization. In this system, the first equation is designed to estimate the background component, incorporating both diffusion and fidelity terms. The second equation estimates the foreground component and includes diffusion, fidelity, and binarization source terms. The final binarization result is obtained by applying a hard projection to the estimated foreground component. Experimental results on 86 degraded text images demonstrate that the proposed model exhibits significant advantages in handling degraded text images.

Title: Learning to Search Effective Example Sequences for In-Context Learning

Authors: Xiang Gao, Ankita Sinha, Kamalika Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08030
Pdf URL: https://arxiv.org/pdf/2503.08030
Copy Paste: [[2503.08030]] Learning to Search Effective Example Sequences for In-Context Learning(https://arxiv.org/abs/2503.08030)
Keywords: in-context
Abstract: Large language models (LLMs) demonstrate impressive few-shot learning capabilities, but their performance varies widely based on the sequence of in-context examples. Key factors influencing this include the sequence's length, composition, and arrangement, as well as its relation to the specific query. Existing methods often tackle these factors in isolation, overlooking their interdependencies. Moreover, the extensive search space for selecting optimal sequences complicates the development of a holistic approach. In this work, we introduce Beam Search-based Example Sequence Constructor (BESC), a novel method for learning to construct optimal example sequences. BESC addresses all key factors involved in sequence selection by considering them jointly during inference, while incrementally building the sequence. This design enables the use of beam search to significantly reduce the complexity of the search space. Experiments across various datasets and language models show notable improvements in performance.

Title: A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data

Authors: Naomi Baes, Raphaël Merx, Nick Haslam, Ekaterina Vylomova, Haim Dubossarsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08042
Pdf URL: https://arxiv.org/pdf/2503.08042
Copy Paste: [[2503.08042]] A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data(https://arxiv.org/abs/2503.08042)
Keywords: in-context
Abstract: Lexical Semantic Change (LSC) offers insights into cultural and social dynamics. Yet, the validity of methods for measuring kinds of LSC has yet to be established due to the absence of historical benchmark datasets. To address this gap, we develop a novel three-stage evaluation framework that involves: 1) creating a scalable, domain-general methodology for generating synthetic datasets that simulate theory-driven LSC across time, leveraging In-Context Learning and a lexical database; 2) using these datasets to evaluate the effectiveness of various methods; and 3) assessing their suitability for specific dimensions and domains. We apply this framework to simulate changes across key dimensions of LSC (SIB: Sentiment, Intensity, and Breadth) using examples from psychology, and evaluate the sensitivity of selected methods to detect these artificially induced changes. Our findings support the utility of the synthetic data approach, validate the efficacy of tailored methods for detecting synthetic changes in SIB, and reveal that a state-of-the-art LSC model faces challenges in detecting affective dimensions of LSC. This framework provides a valuable tool for dimension- and domain-specific bench-marking and evaluation of LSC methods, with particular benefits for the social sciences.

Title: Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection

Authors: Ying Fu Lim, Jiawen Zhu, Guansong Pang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.08045
Pdf URL: https://arxiv.org/pdf/2503.08045
Copy Paste: [[2503.08045]] Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection(https://arxiv.org/abs/2503.08045)
Keywords: anomaly
Abstract: Log Anomaly Detection (LAD) seeks to identify atypical patterns in log data that are crucial to assessing the security and condition of systems. Although Large Language Models (LLMs) have shown tremendous success in various fields, the use of LLMs in enabling the detection of log anomalies is largely unexplored. This work aims to fill this gap. Due to the prohibitive costs involved in fully fine-tuning LLMs, we explore the use of parameter-efficient fine-tuning techniques (PEFTs) for adapting LLMs to LAD. To have an in-depth exploration of the potential of LLM-driven LAD, we present a comprehensive investigation of leveraging two of the most popular PEFTs -- Low-Rank Adaptation (LoRA) and Representation Fine-tuning (ReFT) -- to tap into three prominent LLMs of varying size, including RoBERTa, GPT-2, and Llama-3, for parameter-efficient LAD. Comprehensive experiments on four public log datasets are performed to reveal important insights into effective LLM-driven LAD in several key perspectives, including the efficacy of these PEFT-based LLM-driven LAD methods, their stability, sample efficiency, robustness w.r.t. unstable logs, and cross-dataset generalization. Code is available at this https URL.

Title: SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models

Authors: Nadarasar Bahavan, Sachith Seneviratne, Saman Halgamuge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08049
Pdf URL: https://arxiv.org/pdf/2503.08049
Copy Paste: [[2503.08049]] SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models(https://arxiv.org/abs/2503.08049)
Keywords: generative
Abstract: The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance.

Title: Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm

Authors: Nadarasar Bahavan, Sanjay Saha, Ken Chen, Sachith Seneviratne, Sanka Rasnayaka, Saman Halgamuge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08055
Pdf URL: https://arxiv.org/pdf/2503.08055
Copy Paste: [[2503.08055]] Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm(https://arxiv.org/abs/2503.08055)
Keywords: generative
Abstract: Facial forgery methods such as deepfakes can be misused for identity manipulation and spreading misinformation. They have evolved alongside advancements in generative AI, leading to new and more sophisticated forgery techniques that diverge from existing 'known' methods. Conventional deepfake detection methods use the closedset paradigm, thus limiting their applicability to detecting forgeries created using methods that are not part of the training dataset. In this paper, we propose a shift from the closed-set paradigm for deepfake detection. In the open-set paradigm, models are designed not only to identify images created by known facial forgery methods but also to identify and flag those produced by previously unknown methods as 'unknown' and not as unforged/real/unmanipulated. In this paper, we propose an open-set deepfake classification algorithm based on supervised contrastive learning. The open-set paradigm used in our model allows it to function as a more robust tool capable of handling emerging and unseen deepfake techniques, enhancing reliability and confidence, and complementing forensic analysis. In open-set paradigm, we identify three groups including the "unknown group that is neither considered known deepfake nor real. We investigate deepfake open-set classification across three scenarios, classifying deepfakes from unknown methods not as real, distinguishing real images from deepfakes, and classifying deepfakes from known methods, using the FaceForensics++ dataset as a benchmark. Our method achieves state of the art results in the first two tasks and competitive results in the third task.

Title: Seeing Beyond Haze: Generative Nighttime Image Dehazing

Authors: Beibei Lin, Stephen Lin, Robby Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08073
Pdf URL: https://arxiv.org/pdf/2503.08073
Copy Paste: [[2503.08073]] Seeing Beyond Haze: Generative Nighttime Image Dehazing(https://arxiv.org/abs/2503.08073)
Keywords: diffusion, generative
Abstract: Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or completely obscure background information. Existing methods often encounter difficulties due to insufficient background priors and limited generative ability, both essential for handling such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only significantly reduces haze and glow effects but also infers background information in regions where it may be absent. Our approach is developed on two main ideas: gaining strong background priors by adapting image diffusion models to the nighttime dehazing problem, and enhancing generative ability for haze- and glow-obscured scene areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model in a manner that preserves its capacity to generate clean images. The diffusion model is additionally trained on image pairs designed to improve its ability to generate background details and content that are missing in the input image due to haze effects. Since generative models are susceptible to hallucinations, we develop our framework to allow user control over the generative level, balancing visual realism and factual accuracy. Experiments on real-world images demonstrate that BeyondHaze effectively restores visibility in dense nighttime haze.

Title: Degradation Self-Supervised Learning for Lithium-ion Battery Health Diagnostics

Authors: J. C. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08083
Pdf URL: https://arxiv.org/pdf/2503.08083
Copy Paste: [[2503.08083]] Degradation Self-Supervised Learning for Lithium-ion Battery Health Diagnostics(https://arxiv.org/abs/2503.08083)
Keywords: self-supervised
Abstract: Health evaluation for lithium-ion batteries (LIBs) typically relies on constant charging/discharging protocols, often neglecting scenarios involving dynamic current profiles prevalent in electric vehicles. Conventional health indicators for LIBs also depend on the uniformity of measured data, restricting their adaptability to non-uniform conditions. In this study, a novel training strategy for estimating LIB health based on the paradigm of self-supervised learning is proposed. A multiresolution analysis technique, empirical wavelet transform, is utilized to decompose non-stationary voltage signals in the frequency domain. This allows the removal of ineffective components for the health evaluation model. The transformer neural network serves as the model backbone, and a loss function is designed to describe the capacity degradation behavior with the assumption that the degradation in LIBs across most operating conditions is inevitable and irreversible. The results show that the model can learn the aging characteristics by analyzing sequences of voltage and current profiles obtained at various time intervals from the same LIB cell. The proposed method is successfully applied to the Stanford University LIB aging dataset, derived from electric vehicle real driving profiles. Notably, this approach achieves an average correlation coefficient of 0.9 between the evaluated health index and the degradation of actual capacity, demonstrating its efficacy in capturing LIB health degradation. This research highlights the feasibility of training deep neural networks using unlabeled LIB data, offering cost-efficient means and unleashing the potential of the measured information.

Title: PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models

Authors: Kyeongkook Seo, Dong-Jun Han, Jaejun Yoo
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08085
Pdf URL: https://arxiv.org/pdf/2503.08085
Copy Paste: [[2503.08085]] PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models(https://arxiv.org/abs/2503.08085)
Keywords: generative
Abstract: Despite recent advancements in federated learning (FL), the integration of generative models into FL has been limited due to challenges such as high communication costs and unstable training in heterogeneous data environments. To address these issues, we propose PRISM, a FL framework tailored for generative models that ensures (i) stable performance in heterogeneous data distributions and (ii) resource efficiency in terms of communication cost and final model size. The key of our method is to search for an optimal stochastic binary mask for a random network rather than updating the model weights, identifying a sparse subnetwork with high generative performance; i.e., a ``strong lottery ticket''. By communicating binary masks in a stochastic manner, PRISM minimizes communication overhead. This approach, combined with the utilization of maximum mean discrepancy (MMD) loss and a mask-aware dynamic moving average aggregation method (MADA) on the server side, facilitates stable and strong generative capabilities by mitigating local divergence in FL scenarios. Moreover, thanks to its sparsifying characteristic, PRISM yields a lightweight model without extra pruning or quantization, making it ideal for environments such as edge devices. Experiments on MNIST, FMNIST, CelebA, and CIFAR10 demonstrate that PRISM outperforms existing methods, while maintaining privacy with minimal communication costs. PRISM is the first to successfully generate images under challenging non-IID and privacy-preserving FL environments on complex datasets, where previous methods have struggled.

Title: MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution

Authors: Xinrui Li, Jianlong Wu, Xinchuan Huang, Chong Chen, Weili Guan, Xian-Sheng Hua, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08096
Pdf URL: https://arxiv.org/pdf/2503.08096
Copy Paste: [[2503.08096]] MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution(https://arxiv.org/abs/2503.08096)
Keywords: diffusion
Abstract: Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at different depths and the fine-grained, concrete semantics inherently present in the images themselves. Moreover, relying solely on a single type of guidance further disrupts the consistency of reconstruction. To address these issues, we propose MegaSR, a novel framework that mines customized block-wise semantics and expressive guidance for diffusion-based ISR. Compared to uniform textual semantics, MegaSR enables flexible adaptation to multi-granularity semantic awareness by dynamically incorporating image attributes at each block. Furthermore, we experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance, and propose a multi-stage aggregation strategy to modulate them into the T2I models. Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.

Title: ACE: Concept Editing in Diffusion Models without Performance Degradation

Authors: Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08116
Pdf URL: https://arxiv.org/pdf/2503.08116
Copy Paste: [[2503.08116]] ACE: Concept Editing in Diffusion Models without Performance Degradation(https://arxiv.org/abs/2503.08116)
Keywords: diffusion
Abstract: Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at this https URL

Title: Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Authors: Weiguo Gao, Ming Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08117
Pdf URL: https://arxiv.org/pdf/2503.08117
Copy Paste: [[2503.08117]] Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models(https://arxiv.org/abs/2503.08117)
Keywords: generative
Abstract: The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other's training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.

Title: Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Authors: Junzhe Li, Xuerui Qiu, Linrui Xu, Liya Guo, Delin Qu, Tingting Long, Chun Fan, Ming Li
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.08120
Pdf URL: https://arxiv.org/pdf/2503.08120
Copy Paste: [[2503.08120]] Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models(https://arxiv.org/abs/2503.08120)
Keywords: diffusion, generative
Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on $\textbf{coarse}$ facial attribute understanding, with limited capacity to handle $\textbf{fine-grained}$ facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni$\textbf{F}^2$ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni$\textbf{F}^2$ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni$\textbf{F}^2$ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni$\textbf{F}^2$ace-130K demonstrate that Uni$\textbf{F}^2$ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

Title: Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments

Authors: Soonwoo Kwon, Jin-Young Kim, Hyojun Go, Kyungjune Baek
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08122
Pdf URL: https://arxiv.org/pdf/2503.08122
Copy Paste: [[2503.08122]] Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments(https://arxiv.org/abs/2503.08122)
Keywords: diffusion, generative
Abstract: We present a novel study on enhancing the capability of preserving the content in world models, focusing on a property we term World Stability. Recent diffusion-based generative models have advanced the synthesis of immersive and realistic environments that are pivotal for applications such as reinforcement learning and interactive game engines. However, while these models excel in quality and diversity, they often neglect the preservation of previously generated scenes over time--a shortfall that can introduce noise into agent learning and compromise performance in safety-critical settings. In this work, we introduce an evaluation framework that measures world stability by having world models perform a sequence of actions followed by their inverses to return to their initial viewpoint, thereby quantifying the consistency between the starting and ending observations. Our comprehensive assessment of state-of-the-art diffusion-based world models reveals significant challenges in achieving high world stability. Moreover, we investigate several improvement strategies to enhance world stability. Our results underscore the importance of world stability in world modeling and provide actionable insights for future research in this domain.

Title: MGHanD: Multi-modal Guidance for authentic Hand Diffusion

Authors: Taehyeon Eum, Jieun Choi, Tae-Kyun Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08133
Pdf URL: https://arxiv.org/pdf/2503.08133
Copy Paste: [[2503.08133]] MGHanD: Multi-modal Guidance for authentic Hand Diffusion(https://arxiv.org/abs/2503.08133)
Keywords: diffusion, generative
Abstract: Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from `hands' towards more detailed prompts such as `natural hands', and `anatomically correct fingers' at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.

Title: ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting

Authors: Junfu Guo, Yu Xin, Gaoyi Liu, Kai Xu, Ligang Liu, Ruizhen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08135
Pdf URL: https://arxiv.org/pdf/2503.08135
Copy Paste: [[2503.08135]] ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting(https://arxiv.org/abs/2503.08135)
Keywords: self-supervised
Abstract: We tackle the challenge of concurrent reconstruction at the part level with the RGB appearance and estimation of motion parameters for building digital twins of articulated objects using the 3D Gaussian Splatting (3D-GS) method. With two distinct sets of multi-view imagery, each depicting an object in separate static articulation configurations, we reconstruct the articulated object in 3D Gaussian representations with both appearance and geometry information at the same time. Our approach decoupled multiple highly interdependent parameters through a multi-step optimization process, thereby achieving a stable optimization procedure and high-quality outcomes. We introduce ArticulatedGS, a self-supervised, comprehensive framework that autonomously learns to model shapes and appearances at the part level and synchronizes the optimization of motion parameters, all without reliance on 3D supervision, motion cues, or semantic labels. Our experimental results demonstrate that, among comparable methodologies, our approach has achieved optimal outcomes in terms of part segmentation accuracy, motion estimation accuracy, and visual quality.

Title: FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems

Authors: Jeongsol Kim, Bryan Sangwoo Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08136
Pdf URL: https://arxiv.org/pdf/2503.08136
Copy Paste: [[2503.08136]] FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems(https://arxiv.org/abs/2503.08136)
Keywords: diffusion, generative
Abstract: Flow matching is a recent state-of-the-art framework for generative modeling based on ordinary differential equations (ODEs). While closely related to diffusion models, it provides a more general perspective on generative modeling. Although inverse problem solving has been extensively explored using diffusion models, it has not been rigorously examined within the broader context of flow models. Therefore, here we extend the diffusion inverse solvers (DIS) - which perform posterior sampling by combining a denoising diffusion prior with an likelihood gradient - into the flow framework. Specifically, by driving the flow-version of Tweedie's formula, we decompose the flow ODE into two components: one for clean image estimation and the other for noise estimation. By integrating the likelihood gradient and stochastic noise into each component, respectively, we demonstrate that posterior sampling for inverse problem solving can be effectively achieved using flows. Our proposed solver, Flow-Driven Posterior Sampling (FlowDPS), can also be seamlessly integrated into a latent flow model with a transformer architecture. Across four linear inverse problems, we confirm that FlowDPS outperforms state-of-the-art alternatives, all without requiring additional training.

Title: FilmComposer: LLM-Driven Music Production for Silent Film Clips

Authors: Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.08147
Pdf URL: https://arxiv.org/pdf/2503.08147
Copy Paste: [[2503.08147]] FilmComposer: LLM-Driven Music Production for Silent Film Clips(https://arxiv.org/abs/2503.08147)
Keywords: generative
Abstract: In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: this https URL

Title: Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features

Authors: Hanbyul Lee, Juneho Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08148
Pdf URL: https://arxiv.org/pdf/2503.08148
Copy Paste: [[2503.08148]] Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features(https://arxiv.org/abs/2503.08148)
Keywords: generative
Abstract: Recently, images that distort or fabricate facts using generative models have become a social concern. To cope with continuous evolution of generative artificial intelligence (AI) models, model attribution (MA) is necessary beyond just detection of synthetic images. However, current deep learning-based MA methods must be trained from scratch with new data to recognize unseen models, which is time-consuming and data-intensive. This work proposes a new strategy to deal with persistently emerging generative models. We adapt few-shot class-incremental learning (FSCIL) mechanisms for MA problem to uncover novel generative AI models. Unlike existing FSCIL approaches that focus on object classification using high-level information, MA requires analyzing low-level details like color and texture in synthetic images. Thus, we utilize a learnable representation from different levels of CLIP-ViT features. To learn an effective representation, we propose Adaptive Integration Module (AIM) to calculate a weighted sum of CLIP-ViT block features for each image, enhancing the ability to identify generative models. Extensive experiments show our method effectively extends from prior generative models to recent ones.

Title: U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

Authors: Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08157
Pdf URL: https://arxiv.org/pdf/2503.08157
Copy Paste: [[2503.08157]] U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers(https://arxiv.org/abs/2503.08157)
Keywords: diffusion
Abstract: Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.

Title: Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation

Authors: Taojie Kuang, Qianli Ma, Athanasios V. Vasilakos, Yu Wang, Qiang (Shawn)Cheng, Zhixiang Ren
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08160
Pdf URL: https://arxiv.org/pdf/2503.08160
Copy Paste: [[2503.08160]] Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation(https://arxiv.org/abs/2503.08160)
Keywords: diffusion
Abstract: In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug-likeness, and interpretability, while fragment-based approaches frequently overlook comprehensive factors that influence protein-molecule interactions. To address these challenges, we propose a novel fragment-based molecular generation framework tailored for specific proteins. Our method begins by constructing a protein subpocket and molecular arm concept-based neural network, which systematically integrates interaction force information and geometric complementarity to sample molecular arms for specific protein subpockets. Subsequently, we introduce a diffusion model to generate molecular backbones that connect these arms, ensuring structural integrity and chemical diversity. Our approach significantly improves synthetic feasibility and binding affinity, with a 4% increase in drug-likeness and a 6% improvement in synthetic feasibility. Furthermore, by integrating explicit interaction data through a concept-based model, our framework enhances interpretability, offering valuable insights into the molecular design process.

Title: Multimodal Generation of Animatable 3D Human Models with AvatarForge

Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08165
Pdf URL: https://arxiv.org/pdf/2503.08165
Copy Paste: [[2503.08165]] Multimodal Generation of Animatable 3D Human Models with AvatarForge(https://arxiv.org/abs/2503.08165)
Keywords: diffusion
Abstract: We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.

Title: TSCnet: A Text-driven Semantic-level Controllable Framework for Customized Low-Light Image Enhancement

Authors: Miao Zhang, Jun Yin, Pengyu Zeng, Yiqing Shen, Shuai Lu, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08168
Pdf URL: https://arxiv.org/pdf/2503.08168
Copy Paste: [[2503.08168]] TSCnet: A Text-driven Semantic-level Controllable Framework for Customized Low-Light Image Enhancement(https://arxiv.org/abs/2503.08168)
Keywords: diffusion
Abstract: Deep learning-based image enhancement methods show significant advantages in reducing noise and improving visibility in low-light conditions. These methods are typically based on one-to-one mapping, where the model learns a direct transformation from low light to specific enhanced images. Therefore, these methods are inflexible as they do not allow highly personalized mapping, even though an individual's lighting preferences are inherently personalized. To overcome these limitations, we propose a new light enhancement task and a new framework that provides customized lighting control through prompt-driven, semantic-level, and quantitative brightness adjustments. The framework begins by leveraging a Large Language Model (LLM) to understand natural language prompts, enabling it to identify target objects for brightness adjustments. To localize these target objects, the Retinex-based Reasoning Segment (RRS) module generates precise target localization masks using reflection images. Subsequently, the Text-based Brightness Controllable (TBC) module adjusts brightness levels based on the generated illumination map. Finally, an Adaptive Contextual Compensation (ACC) module integrates multi-modal inputs and controls a conditional diffusion model to adjust the lighting, ensuring seamless and precise enhancements accurately. Experimental results on benchmark datasets demonstrate our framework's superior performance at increasing visibility, maintaining natural color balance, and amplifying fine details without creating artifacts. Furthermore, its robust generalization capabilities enable complex semantic-level lighting adjustments in diverse open-world environments through natural language interactions.

Title: Towards All-in-One Medical Image Re-Identification

Authors: Yuan Tian, Kaiyuan Ji, Rongzhao Zhang, Yankai Jiang, Chunyi Li, Xiaosong Wang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08173
Pdf URL: https://arxiv.org/pdf/2503.08173
Copy Paste: [[2503.08173]] Towards All-in-One Medical Image Re-Identification(https://arxiv.org/abs/2503.08173)
Keywords: foundation model
Abstract: Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{this https URL}{this https URL}.

Title: Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models

Authors: Xuanhan Wang, Huimin Deng, Lianli Gao, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08201
Pdf URL: https://arxiv.org/pdf/2503.08201
Copy Paste: [[2503.08201]] Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models(https://arxiv.org/abs/2503.08201)
Keywords: self-supervised
Abstract: Human-centric visual perception (HVP) has recently achieved remarkable progress due to advancements in large-scale self-supervised pretraining (SSP). However, existing HVP models face limitations in adapting to real-world applications, which require general visual patterns for downstream tasks while maintaining computationally sustainable costs to ensure compatibility with edge devices. These limitations primarily arise from two issues: 1) the pretraining objectives focus solely on specific visual patterns, limiting the generalizability of the learned patterns for diverse downstream tasks; and 2) HVP models often exhibit excessively large model sizes, making them incompatible with real-world applications. To address these limitations, we introduce Scale-Aware Image Pretraining (SAIP), a novel SSP framework enabling lightweight vision models to acquire general patterns for HVP. Specifically, SAIP incorporates three learning objectives based on the principle of cross-scale consistency: 1) Cross-scale Matching (CSM) which contrastively learns image-level invariant patterns from multi-scale single-person images; 2) Cross-scale Reconstruction (CSR) which learns pixel-level consistent visual structures from multi-scale masked single-person images; and 3) Cross-scale Search (CSS) which learns to capture diverse patterns from multi-scale multi-person images. Three objectives complement one another, enabling lightweight models to learn multi-scale generalizable patterns essential for HVP downstream this http URL experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable generalization capabilities across 9 human-centric vision tasks. Moreover, it achieves significant performance improvements over existing methods, with gains of 3%-13% in single-person discrimination tasks, 1%-11% in dense prediction tasks, and 1%-6% in multi-person visual understanding tasks.

Title: A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning

Authors: Chungpa Lee, Jeongheon Oh, Kibok Lee, Jy-yong Sohn
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08203
Pdf URL: https://arxiv.org/pdf/2503.08203
Copy Paste: [[2503.08203]] A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning(https://arxiv.org/abs/2503.08203)
Keywords: self-supervised
Abstract: Supervised contrastive learning (SupCL) has emerged as a prominent approach in representation learning, leveraging both supervised and self-supervised losses. However, achieving an optimal balance between these losses is challenging; failing to do so can lead to class collapse, reducing discrimination among individual embeddings in the same class. In this paper, we present theoretically grounded guidelines for SupCL to prevent class collapse in learned representations. Specifically, we introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including all embeddings that minimize the supervised contrastive loss. Through SSEM, we analyze how hyperparameters affect learned representations, offering practical guidelines for hyperparameter selection to mitigate the risk of class collapse. Our theoretical findings are supported by empirical results across synthetic and real-world datasets.

Title: S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

Authors: Guangting Zheng, Jiajun Deng, Xiaomeng Chu, Yu Yuan, Houqiang Li, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08217
Pdf URL: https://arxiv.org/pdf/2503.08217
Copy Paste: [[2503.08217]] S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction(https://arxiv.org/abs/2503.08217)
Keywords: foundation model
Abstract: Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead. After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios. Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-the-art PSNR and SSIM, reducing reconstruction time to below 50%--and even 20%--of competing methods.

Title: MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior

Authors: Kaiqiang Xiong, Ying Feng, Qi Zhang, Jianbo Jiao, Yang Zhao, Zhihao Liang, Huachen Gao, Ronggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08218
Pdf URL: https://arxiv.org/pdf/2503.08218
Copy Paste: [[2503.08218]] MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior(https://arxiv.org/abs/2503.08218)
Keywords: diffusion
Abstract: 3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the this http URL, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

Title: Aligning Text to Image in Diffusion Models is Easier Than You Think

Authors: Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08250
Pdf URL: https://arxiv.org/pdf/2503.08250
Copy Paste: [[2503.08250]] Aligning Text to Image in Diffusion Models is Easier Than You Think(https://arxiv.org/abs/2503.08250)
Keywords: diffusion, generative
Abstract: While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Although many approaches have attempted to address this issue by fine-tuning models using various reward models, etc., we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages both positive and negative pairs. To achieve this efficiently even with pretrained models, we introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.

Title: SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

Authors: Hesen Chen, Junyan Wang, Zhiyu Tan, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08253
Pdf URL: https://arxiv.org/pdf/2503.08253
Copy Paste: [[2503.08253]] SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models(https://arxiv.org/abs/2503.08253)
Keywords: diffusion
Abstract: Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.

Title: DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness

Authors: Yiming Zhong, Qi Jiang, Jingyi Yu, Yuexin Ma
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.08257
Pdf URL: https://arxiv.org/pdf/2503.08257
Copy Paste: [[2503.08257]] DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness(https://arxiv.org/abs/2503.08257)
Keywords: diffusion, generative
Abstract: A dexterous hand capable of grasping any object is essential for the development of general-purpose embodied intelligent robots. However, due to the high degree of freedom in dexterous hands and the vast diversity of objects, generating high-quality, usable grasping poses in a robust manner is a significant challenge. In this paper, we introduce DexGrasp Anything, a method that effectively integrates physical constraints into both the training and sampling phases of a diffusion-based generative model, achieving state-of-the-art performance across nearly all open datasets. Additionally, we present a new dexterous grasping dataset containing over 3.4 million diverse grasping poses for more than 15k different objects, demonstrating its potential to advance universal dexterous grasping. The code of our method and our dataset will be publicly released soon.

Title: PromptLNet: Region-Adaptive Aesthetic Enhancement via Prompt Guidance in Low-Light Enhancement Net

Authors: Jun Yin, Yangfan He, Miao Zhang, Pengyu Zeng, Tianyi Wang, Shuai Lu, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08276
Pdf URL: https://arxiv.org/pdf/2503.08276
Copy Paste: [[2503.08276]] PromptLNet: Region-Adaptive Aesthetic Enhancement via Prompt Guidance in Low-Light Enhancement Net(https://arxiv.org/abs/2503.08276)
Keywords: diffusion
Abstract: Learning and improving large language models through human preference feedback has become a mainstream approach, but it has rarely been applied to the field of low-light image enhancement. Existing low-light enhancement evaluations typically rely on objective metrics (such as FID, PSNR, etc.), which often result in models that perform well objectively but lack aesthetic quality. Moreover, most low-light enhancement models are primarily designed for global brightening, lacking detailed refinement. Therefore, the generated images often require additional local adjustments, leading to research gaps in practical applications. To bridge this gap, we propose the following innovations: 1) We collect human aesthetic evaluation text pairs and aesthetic scores from multiple low-light image datasets (e.g., LOL, LOL2, LOM, DCIM, MEF, etc.) to train a low-light image aesthetic evaluation model, supplemented by an optimization algorithm designed to fine-tune the diffusion model. 2) We propose a prompt-driven brightness adjustment module capable of performing fine-grained brightness and aesthetic adjustments for specific instances or regions. 3) We evaluate our method alongside existing state-of-the-art algorithms on mainstream benchmarks. Experimental results show that our method not only outperforms traditional methods in terms of visual quality but also provides greater flexibility and controllability, paving the way for improved aesthetic quality.

Title: OminiControl2: Efficient Conditioning for Diffusion Transformers

Authors: Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08280
Pdf URL: https://arxiv.org/pdf/2503.08280
Copy Paste: [[2503.08280]] OminiControl2: Efficient Conditioning for Diffusion Transformers(https://arxiv.org/abs/2503.08280)
Keywords: diffusion
Abstract: Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$\times$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.

Title: A systematic literature review of unsupervised learning algorithms for anomalous traffic detection based on flows

Authors: Alberto Miguel-Diez, Adrián Campazas-Vega, Claudia Álvarez-Aparicio, Gonzalo Esteban-Costales, Ángel Manuel Guerrero-Higueras
Subjects: cs.CR, cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2503.08293
Pdf URL: https://arxiv.org/pdf/2503.08293
Copy Paste: [[2503.08293]] A systematic literature review of unsupervised learning algorithms for anomalous traffic detection based on flows(https://arxiv.org/abs/2503.08293)
Keywords: anomaly
Abstract: The constant increase of devices connected to the Internet, and therefore of cyber-attacks, makes it necessary to analyze network traffic in order to recognize malicious activity. Traditional packet-based analysis methods are insufficient because in large networks the amount of traffic is so high that it is unfeasible to review all communications. For this reason, flows is a suitable approach for this situation, which in future 5G networks will have to be used, as the number of packets will increase dramatically. If this is also combined with unsupervised learning models, it can detect new threats for which it has not been trained. This paper presents a systematic review of the literature on unsupervised learning algorithms for detecting anomalies in network flows, following the PRISMA guideline. A total of 63 scientific articles have been reviewed, analyzing 13 of them in depth. The results obtained show that autoencoder is the most used option, followed by SVM, ALAD, or SOM. On the other hand, all the datasets used for anomaly detection have been collected, including some specialised in IoT or with real data collected from honeypots.

Title: D3PO: Preference-Based Alignment of Discrete Diffusion Models

Authors: Umberto Borso, Davide Paglieri, Jude Wells, Tim Rocktäschel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08295
Pdf URL: https://arxiv.org/pdf/2503.08295
Copy Paste: [[2503.08295]] D3PO: Preference-Based Alignment of Discrete Diffusion Models(https://arxiv.org/abs/2503.08295)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D3PO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D3PO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D3PO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D3PO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.

Title: $^R$FLAV: Rolling Flow matching for infinite Audio Video generation

Authors: Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08307
Pdf URL: https://arxiv.org/pdf/2503.08307
Copy Paste: [[2503.08307]] $^R$FLAV: Rolling Flow matching for infinite Audio Video generation(https://arxiv.org/abs/2503.08307)
Keywords: generative
Abstract: Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present \arch{}, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that \arch{} outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at this https URL.

Title: Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework

Authors: Bin Huang, Binzhong He, Yanhan Chen, Zhili Liu, Xinyue Wang, Binxuan Li, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08339
Pdf URL: https://arxiv.org/pdf/2503.08339
Copy Paste: [[2503.08339]] Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework(https://arxiv.org/abs/2503.08339)
Keywords: diffusion
Abstract: Deep learning has significantly advanced PET image re-construction, achieving remarkable improvements in image quality through direct training on sinogram or image data. Traditional methods often utilize masks for inpainting tasks, but their incorporation into PET reconstruction frameworks introduces transformative potential. In this study, we pro-pose an advanced PET reconstruction framework called Diffusion tRansformer mEets rAndom Masks (DREAM). To the best of our knowledge, this is the first work to integrate mask mechanisms into both the sinogram domain and the latent space, pioneering their role in PET reconstruction and demonstrating their ability to enhance reconstruction fidelity and efficiency. The framework employs a high-dimensional stacking approach, transforming masked data from two to three dimensions to expand the solution space and enable the model to capture richer spatial rela-tionships. Additionally, a mask-driven latent space is de-signed to accelerate the diffusion process by leveraging sinogram-driven and mask-driven compact priors, which reduce computational complexity while preserving essen-tial data characteristics. A hierarchical masking strategy is also introduced, guiding the model from focusing on fi-ne-grained local details in the early stages to capturing broader global patterns over time. This progressive ap-proach ensures a balance between detailed feature preservation and comprehensive context understanding. Experimental results demonstrate that DREAM not only improves the overall quality of reconstructed PET images but also preserves critical clinical details, highlighting its potential to advance PET imaging technology. By inte-grating compact priors and hierarchical masking, DREAM offers a promising and efficient avenue for future research and application in PET imaging. The open-source code is available at: this https URL.

Title: Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis

Authors: Chanyoung Kim, Dayun Ju, Jinyeong Kim, Woojung Han, Roberto Alcover-Couso, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08346
Pdf URL: https://arxiv.org/pdf/2503.08346
Copy Paste: [[2503.08346]] Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis(https://arxiv.org/abs/2503.08346)
Keywords: diffusion
Abstract: As recent text-conditioned diffusion models have enabled the generation of high-quality images, concerns over their potential misuse have also grown. This issue is critical in the medical domain, where text-conditioned generated medical images could enable insurance fraud or falsified records, highlighting the urgent need for reliable safeguards against unethical use. While watermarking techniques have emerged as a promising solution in general image domains, their direct application to medical imaging presents significant challenges. A key challenge is preserving fine-grained disease manifestations, as even minor distortions from a watermark may lead to clinical misinterpretation, which compromises diagnostic integrity. To overcome this gap, we present MedSign, a deep learning-based watermarking framework specifically designed for text-to-medical image synthesis, which preserves pathologically significant regions by adaptively adjusting watermark strength. Specifically, we generate a pathology localization map using cross-attention between medical text tokens and the diffusion denoising network, aggregating token-wise attention across layers, heads, and time steps. Leveraging this map, we optimize the LDM decoder to incorporate watermarking during image synthesis, ensuring cohesive integration while minimizing interference in diagnostically critical regions. Experimental results show that our MedSign preserves diagnostic integrity while ensuring watermark robustness, achieving state-of-the-art performance in image quality and detection accuracy on MIMIC-CXR and OIA-ODIR datasets.

Title: Robust Latent Matters: Boosting Image Generation with Sampling Error

Authors: Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08354
Pdf URL: https://arxiv.org/pdf/2503.08354
Copy Paste: [[2503.08354]] Robust Latent Matters: Boosting Image Generation with Sampling Error(https://arxiv.org/abs/2503.08354)
Keywords: generative
Abstract: Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: this https URL.

Title: nnInteractive: Redefining 3D Promptable Segmentation

Authors: Fabian Isensee, Maximilian Rokuss, Lars Krämer, Stefan Dinkelacker, Ashis Ravindran, Florian Stritzke, Benjamin Hamm, Tassilo Wald, Moritz Langenberg, Constantin Ulrich, Jonathan Deissler, Ralf Floca, Klaus Maier-Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08373
Pdf URL: https://arxiv.org/pdf/2503.08373
Copy Paste: [[2503.08373]] nnInteractive: Redefining 3D Promptable Segmentation(https://arxiv.org/abs/2503.08373)
Keywords: foundation model
Abstract: Accurate and efficient 3D segmentation is essential for both clinical and research applications. While foundation models like SAM have revolutionized interactive segmentation, their 2D design and domain shift limitations make them ill-suited for 3D medical images. Current adaptations address some of these challenges but remain limited, either lacking volumetric awareness, offering restricted interactivity, or supporting only a small set of structures and modalities. Usability also remains a challenge, as current tools are rarely integrated into established imaging platforms and often rely on cumbersome web-based interfaces with restricted functionality. We introduce nnInteractive, the first comprehensive 3D interactive open-set segmentation method. It supports diverse prompts-including points, scribbles, boxes, and a novel lasso prompt-while leveraging intuitive 2D interactions to generate full 3D segmentations. Trained on 120+ diverse volumetric 3D datasets (CT, MRI, PET, 3D Microscopy, etc.), nnInteractive sets a new state-of-the-art in accuracy, adaptability, and usability. Crucially, it is the first method integrated into widely used image viewers (e.g., Napari, MITK), ensuring broad accessibility for real-world clinical and research applications. Extensive benchmarking demonstrates that nnInteractive far surpasses existing methods, setting a new standard for AI-driven interactive 3D segmentation. nnInteractive is publicly available: this https URL (Napari plugin), this https URL (MITK integration), this https URL (Python backend).

Title: Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens

Authors: Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08377
Pdf URL: https://arxiv.org/pdf/2503.08377
Copy Paste: [[2503.08377]] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens(https://arxiv.org/abs/2503.08377)
Keywords: diffusion
Abstract: Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. The code and model will be released.

Title: Recognition-Synergistic Scene Text Editing

Authors: Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08387
Pdf URL: https://arxiv.org/pdf/2503.08387
Copy Paste: [[2503.08387]] Recognition-Synergistic Scene Text Editing(https://arxiv.org/abs/2503.08387)
Keywords: self-supervised
Abstract: Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.

Title: DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank

Authors: Zhanjie Zhang, Quanwei Zhang, Guangyuan Li, Junsheng Luan, Mengyuan Yang, Yun Wang, Lei Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08392
Pdf URL: https://arxiv.org/pdf/2503.08392
Copy Paste: [[2503.08392]] DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank(https://arxiv.org/abs/2503.08392)
Keywords: diffusion
Abstract: Artistic style transfer aims to transfer the learned style onto an arbitrary content image. However, most existing style transfer methods can only render consistent artistic stylized images, making it difficult for users to get enough stylized images to enjoy. To solve this issue, we propose a novel artistic style transfer framework called DyArtbank, which can generate diverse and highly realistic artistic stylized images. Specifically, we introduce a Dynamic Style Prompt ArtBank (DSPA), a set of learnable parameters. It can learn and store the style information from the collection of artworks, dynamically guiding pre-trained stable diffusion to generate diverse and highly realistic artistic stylized images. DSPA can also generate random artistic image samples with the learned style information, providing a new idea for data augmentation. Besides, a Key Content Feature Prompt (KCFP) module is proposed to provide sufficient content prompts for pre-trained stable diffusion to preserve the detailed structure of the input content image. Extensive qualitative and quantitative experiments verify the effectiveness of our proposed method. Code is available: this https URL

Title: OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

Authors: Jiawei Zhou, Lei Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08398
Pdf URL: https://arxiv.org/pdf/2503.08398
Copy Paste: [[2503.08398]] OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning(https://arxiv.org/abs/2503.08398)
Keywords: in-context
Abstract: In this paper, we analyze and empirically show that the learned relevance for conventional information retrieval (IR) scenarios may be inconsistent in retrieval-augmented generation (RAG) scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results indicate that for some tasks, an end-to-end tuned 0.2B retriever can achieve improvements that surpass those of RAG-oriented or instruction-tuned 8B large language models (LLMs), highlighting the cost-effectiveness of our approach in enhancing RAG systems.

Title: Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information

Authors: Elizaveta Kuznetsova, Ilaria Vitulano, Mykola Makhortykh, Martha Stolze, Tomas Nagy, Victoria Vziatysheva
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.08404
Pdf URL: https://arxiv.org/pdf/2503.08404
Copy Paste: [[2503.08404]] Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information(https://arxiv.org/abs/2503.08404)
Keywords: generative
Abstract: The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.

Title: Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing

Authors: Chen Liao, Yan Shen, Dan Li, Zhongli Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08429
Pdf URL: https://arxiv.org/pdf/2503.08429
Copy Paste: [[2503.08429]] Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing(https://arxiv.org/abs/2503.08429)
Keywords: diffusion
Abstract: Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct the image. Codes are available at this https URL.

Title: Controlling Latent Diffusion Using Latent CLIP

Authors: Jason Becker, Chris Wendler, Peter Baylies, Robert West, Christian Wressnegger
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.08455
Pdf URL: https://arxiv.org/pdf/2503.08455
Copy Paste: [[2503.08455]] Controlling Latent Diffusion Using Latent CLIP(https://arxiv.org/abs/2503.08455)
Keywords: diffusion
Abstract: Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.

Title: NullFace: Training-Free Localized Face Anonymization

Authors: Han-Wei Kung, Tuomas Varanka, Terence Sim, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08478
Pdf URL: https://arxiv.org/pdf/2503.08478
Copy Paste: [[2503.08478]] NullFace: Training-Free Localized Face Anonymization(https://arxiv.org/abs/2503.08478)
Keywords: diffusion
Abstract: Privacy concerns around ever increasing number of cameras are increasing in today's digital age. Although existing anonymization methods are able to obscure identity information, they often struggle to preserve the utility of the images. In this work, we introduce a training-free method for face anonymization that preserves key non-identity-related attributes. Our approach utilizes a pre-trained text-to-image diffusion model without requiring optimization or training. It begins by inverting the input image to recover its initial noise. The noise is then denoised through an identity-conditioned diffusion process, where modified identity embeddings ensure the anonymized face is distinct from the original identity. Our approach also supports localized anonymization, giving users control over which facial regions are anonymized or kept intact. Comprehensive evaluations against state-of-the-art methods show our approach excels in anonymization, attribute preservation, and image quality. Its flexibility, robustness, and practicality make it well-suited for real-world applications. Code and data can be found at this https URL .

Title: Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Authors: Shengpeng Xiao, Yuanfang Guo, Heqi Peng, Zeming Liu, Liang Yang, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08484
Pdf URL: https://arxiv.org/pdf/2503.08484
Copy Paste: [[2503.08484]] Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum(https://arxiv.org/abs/2503.08484)
Keywords: diffusion, generative
Abstract: The generalization performance of AI-generated image detection remains a critical challenge. Although most existing methods perform well in detecting images from generative models included in the training set, their accuracy drops significantly when faced with images from unseen generators. To address this limitation, we propose a novel detection method based on the fractal self-similarity of the spectrum, a common feature among images generated by different models. Specifically, we demonstrate that AI-generated images exhibit fractal-like spectral growth through periodic extension and low-pass filtering. This observation motivates us to exploit the similarity among different fractal branches of the spectrum. Instead of directly analyzing the spectrum, our method mitigates the impact of varying spectral characteristics across different generators, improving detection performance for images from unseen models. Experiments on a public benchmark demonstrated the generalized detection performance across both GANs and diffusion models.

Title: TT-GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting

Authors: Fengyi Zhang, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08485
Pdf URL: https://arxiv.org/pdf/2503.08485
Copy Paste: [[2503.08485]] TT-GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting(https://arxiv.org/abs/2503.08485)
Keywords: self-supervised, foundation model
Abstract: Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense voxel decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and such models often fail to adapt to varying voxel resolutions or new classes without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-GaussOcc. Our approach incrementally optimizes time-aware 3D Gaussians instantiated from raw sensor streams at runtime, enabling voxelization at arbitrary user-specified resolution. Specifically, TT-GaussOcc operates in a "lift-move-voxel" symphony: we first "lift" surrounding-view semantics obtained from 2D vision foundation models (VLMs) to instantiate Gaussians at non-empty 3D space; Next, we "move" dynamic Gaussians from previous frames along estimated Gaussian scene flow to complete appearance and eliminate trailing artifacts of fast-moving objects, while accumulating static Gaussians to enforce temporal consistency; Finally, we mitigate inherent noises in semantic predictions and scene flow vectors by periodically smoothing neighboring Gaussians during optimization, using proposed trilateral RBF kernels that jointly consider color, semantic, and spatial affinities. The historical static and current dynamic Gaussians are then combined and voxelized to generate occupancy prediction. Extensive experiments on Occ3D and nuCraft with varying voxel resolutions demonstrate that TT-GaussOcc surpasses self-supervised baselines by 46% on mIoU without any offline training, and supports finer voxel resolutions at 2.6 FPS inference speed.

Title: Learning to Match Unpaired Data with Minimum Entropy Coupling

Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08501
Pdf URL: https://arxiv.org/pdf/2503.08501
Copy Paste: [[2503.08501]] Learning to Match Unpaired Data with Minimum Entropy Coupling(https://arxiv.org/abs/2503.08501)
Keywords: diffusion, generative
Abstract: Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

Title: DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning

Authors: Sergey Alyaev, Kristian Fossum, Hibat Errahmen Djecta, Jan Tveranger, Ahmed H. Elsheikh
Subjects: cs.LG, math.OC, physics.geo-ph, stat.AP
Abstract URL: https://arxiv.org/abs/2503.08509
Pdf URL: https://arxiv.org/pdf/2503.08509
Copy Paste: [[2503.08509]] DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning(https://arxiv.org/abs/2503.08509)
Keywords: generative
Abstract: The real-time process of directional changes while drilling, known as geosteering, is crucial for hydrocarbon extraction and emerging directional drilling applications such as geothermal energy, civil infrastructure, and CO2 storage. The geo-energy industry seeks an automatic geosteering workflow that continually updates the subsurface uncertainties and captures the latest geological understanding given the most recent observations in real-time. We propose "DISTINGUISH": a real-time, AI-driven workflow designed to transform geosteering by integrating Generative Adversarial Networks (GANs) for geological parameterization, ensemble methods for model updating, and global discrete dynamic programming (DDP) optimization for complex decision-making during directional drilling operations. The DISTINGUISH framework relies on offline training of a GAN model to reproduce relevant geology realizations and a Forward Neural Network (FNN) to model Logging-While-Drilling (LWD) tools' response for a given geomodel. This paper introduces a first-of-its-kind workflow that progressively reduces GAN-geomodel uncertainty around and ahead of the drilling bit and adjusts the well plan accordingly. The workflow automatically integrates real-time LWD data with a DDP-based decision support system, enhancing predictive models of geology ahead of drilling and leading to better steering decisions. We present a simple yet representative benchmark case and document the performance target achieved by the DISTINGUISH workflow prototype. This benchmark will be a foundation for future methodological advancements and workflow refinements.

Title: SAS: Segment Any 3D Scene with Integrated 2D Priors

Authors: Zhuoyuan Li, Jiahao Lu, Jiacheng Deng, Hanzhi Chang, Lifan Wu, Yanzhe Liang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08512
Pdf URL: https://arxiv.org/pdf/2503.08512
Copy Paste: [[2503.08512]] SAS: Segment Any 3D Scene with Integrated 2D Priors(https://arxiv.org/abs/2503.08512)
Keywords: diffusion
Abstract: The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.

Title: High-Quality 3D Head Reconstruction from Any Single Portrait Image

Authors: Jianfu Zhang, yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08516
Pdf URL: https://arxiv.org/pdf/2503.08516
Copy Paste: [[2503.08516]] High-Quality 3D Head Reconstruction from Any Single Portrait Image(https://arxiv.org/abs/2503.08516)
Keywords: diffusion, generative
Abstract: In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories

Title: SignRep: Enhancing Self-Supervised Sign Representations

Authors: Ryan Wong, Necati Cihan Camgoz, Richard Bowden
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08529
Pdf URL: https://arxiv.org/pdf/2503.08529
Copy Paste: [[2503.08529]] SignRep: Enhancing Self-Supervised Sign Representations(https://arxiv.org/abs/2503.08529)
Keywords: self-supervised
Abstract: Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.

Title: ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems

Authors: Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.08533
Pdf URL: https://arxiv.org/pdf/2503.08533
Copy Paste: [[2503.08533]] ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems(https://arxiv.org/abs/2503.08533)
Keywords: foundation model
Abstract: Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: this https URL.

Title: Modular Customization of Diffusion Models via Blockwise-Parameterized Low-Rank Adaptation

Authors: Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08575
Pdf URL: https://arxiv.org/pdf/2503.08575
Copy Paste: [[2503.08575]] Modular Customization of Diffusion Models via Blockwise-Parameterized Low-Rank Adaptation(https://arxiv.org/abs/2503.08575)
Keywords: diffusion
Abstract: Recent diffusion model customization has shown impressive results in incorporating subject or style concepts with a handful of images. However, the modular composition of multiple concepts into a customized model, aimed to efficiently merge decentralized-trained concepts without influencing their identities, remains unresolved. Modular customization is essential for applications like concept stylization and multi-concept customization using concepts trained by different users. Existing post-training methods are only confined to a fixed set of concepts, and any different combinations require a new round of retraining. In contrast, instant merging methods often cause identity loss and interference of individual merged concepts and are usually limited to a small number of concepts. To address these issues, we propose BlockLoRA, an instant merging method designed to efficiently combine multiple concepts while accurately preserving individual concepts' identity. With a careful analysis of the underlying reason for interference, we develop the Randomized Output Erasure technique to minimize the interference of different customized models. Additionally, Blockwise LoRA Parameterization is proposed to reduce the identity loss during instant model merging. Extensive experiments validate the effectiveness of BlockLoRA, which can instantly merge 15 concepts of people, subjects, scenes, and styles with high fidelity.

Title: 3D Point Cloud Generation via Autoregressive Up-sampling

Authors: Ziqiao Meng, Qichao Wang, Zhipeng Zhou, Irwin King, Peilin Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08594
Pdf URL: https://arxiv.org/pdf/2503.08594
Copy Paste: [[2503.08594]] 3D Point Cloud Generation via Autoregressive Up-sampling(https://arxiv.org/abs/2503.08594)
Keywords: diffusion, generative
Abstract: We introduce a pioneering autoregressive generative model for 3D point cloud generation. Inspired by visual autoregressive modeling (VAR), we conceptualize point cloud generation as an autoregressive up-sampling process. This leads to our novel model, PointARU, which progressively refines 3D point clouds from coarse to fine scales. PointARU follows a two-stage training paradigm: first, it learns multi-scale discrete representations of point clouds, and then it trains an autoregressive transformer for next-scale prediction. To address the inherent unordered and irregular structure of point clouds, we incorporate specialized point-based up-sampling network modules in both stages and integrate 3D absolute positional encoding based on the decoded point cloud at each scale during the second stage. Our model surpasses state-of-the-art (SoTA) diffusion-based approaches in both generation quality and parameter efficiency across diverse experimental settings, marking a new milestone for autoregressive methods in 3D point cloud generation. Furthermore, PointARU demonstrates exceptional performance in completing partial 3D shapes and up-sampling sparse point clouds, outperforming existing generative models in these tasks.

Title: Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

Authors: Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, Jinwoo Shin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08605
Pdf URL: https://arxiv.org/pdf/2503.08605
Copy Paste: [[2503.08605]] Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling(https://arxiv.org/abs/2503.08605)
Keywords: diffusion
Abstract: While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.

Title: Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention

Authors: Emily Xiao, Chin-Jou Li, Yilin Zhang, Graham Neubig, Amanda Bertsch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08640
Pdf URL: https://arxiv.org/pdf/2503.08640
Copy Paste: [[2503.08640]] Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention(https://arxiv.org/abs/2503.08640)
Keywords: in-context
Abstract: Many-shot in-context learning has recently shown promise as an alternative to finetuning, with the major advantage that the same model can be served for multiple tasks. However, this shifts the computational burden from training-time to inference-time, making deployment of many-shot ICL challenging to justify in-practice. This cost is further increased if a custom demonstration set is retrieved for each inference example. We present Dynamic Block-Sparse Attention, a training-free framework for retrieval-based many-shot in-context learning. By combining carefully designed block-sparse attention and retrieval of cached groups of demonstrations, we achieve comparable per-example latency to finetuning while maintaining on average >95% of the best method's accuracy across strong ICL and finetuning baselines. We hope that this will further enable the deployment of many-shot ICL at scale.

Title: Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Authors: Parishad BehnamGhader, Nicholas Meade, Siva Reddy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08644
Pdf URL: https://arxiv.org/pdf/2503.08644
Copy Paste: [[2503.08644]] Exploiting Instruction-Following Retrievers for Malicious Information Retrieval(https://arxiv.org/abs/2503.08644)
Keywords: in-context
Abstract: Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

Title: MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input

Authors: Zhenchen Wan, Yanwu xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08650
Pdf URL: https://arxiv.org/pdf/2503.08650
Copy Paste: [[2503.08650]] MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input(https://arxiv.org/abs/2503.08650)
Keywords: diffusion
Abstract: Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: this https URL.

Title: MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Authors: Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, Chen Change Loy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08664
Pdf URL: https://arxiv.org/pdf/2503.08664
Copy Paste: [[2503.08664]] MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention(https://arxiv.org/abs/2503.08664)
Keywords: diffusion
Abstract: Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.

Title: REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

Authors: Yitian Zhang, Long Mai, Aniruddha Mahapatra, David Bourgin, Yicong Hong, Jonah Casebeer, Feng Liu, Yun Fu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08665
Pdf URL: https://arxiv.org/pdf/2503.08665
Copy Paste: [[2503.08665]] REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder(https://arxiv.org/abs/2503.08665)
Keywords: diffusion, generative
Abstract: We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.

Title: Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

Authors: Tobias Kreiman, Aditi S. Krishnapriyan
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.08674
Pdf URL: https://arxiv.org/pdf/2503.08674
Copy Paste: [[2503.08674]] Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields(https://arxiv.org/abs/2503.08674)
Keywords: foundation model
Abstract: Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at this https URL.

Title: Language-Depth Navigated Thermal and Visible Image Fusion

Authors: Jinchang Zhang, Zijun Li, Guoyu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08676
Pdf URL: https://arxiv.org/pdf/2503.08676
Copy Paste: [[2503.08676]] Language-Depth Navigated Thermal and Visible Image Fusion(https://arxiv.org/abs/2503.08676)
Keywords: diffusion
Abstract: Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.

Title: OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Authors: Yongsheng Yu, Ziyun Zeng, Haitian Zheng, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08677
Pdf URL: https://arxiv.org/pdf/2503.08677
Copy Paste: [[2503.08677]] OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting(https://arxiv.org/abs/2503.08677)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: this https URL

Title: "Principal Components" Enable A New Language of Images

Authors: Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08685
Pdf URL: https://arxiv.org/pdf/2503.08685
Copy Paste: [[2503.08685]] "Principal Components" Enable A New Language of Images(https://arxiv.org/abs/2503.08685)
Keywords: diffusion
Abstract: We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space -- a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.