2025-03-13

Title: Versatile Multimodal Controls for Whole-Body Talking Human Animation

Authors: Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Zixin Zhu, Minghui Yang, Ming Yang, Le Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08714
Pdf URL: https://arxiv.org/pdf/2503.08714
Copy Paste: [[2503.08714]] Versatile Multimodal Controls for Whole-Body Talking Human Animation(https://arxiv.org/abs/2503.08714)
Keywords: diffusion
Abstract: Human animation from a single reference image shall be flexible to synthesize whole-body motion for either a headshot or whole-body portrait, where the motions are readily controlled by audio signal and text prompts. This is hard for most existing methods as they only support producing pre-specified head or half-body motion aligned with audio inputs. In this paper, we propose a versatile human animation method, i.e., VersaAnimator, which generates whole-body talking human from arbitrary portrait images, not only driven by audio signal but also flexibly controlled by text prompts. Specifically, we design a text-controlled, audio-driven motion generator that produces whole-body motion representations in 3D synchronized with audio inputs while following textual motion descriptions. To promote natural smooth motion, we propose a code-pose translation module to link VAE codebooks with 2D DWposes extracted from template videos. Moreover, we introduce a multi-modal video diffusion that generates photorealistic human animation from a reference image according to both audio inputs and whole-body motion representations. Extensive experiments show that VersaAnimator outperforms existing methods in visual quality, identity preservation, and audio-lip synchronization.

Title: A Recipe for Improving Remote Sensing VLM Zero Shot Generalization

Authors: Aviad Barzilai, Yotam Gigi, Vered Silverman, Yehonathan Refael, Bolous Jaber, Amr Helmy, Tomer Shekel, George Leifman, Genady Beryozkin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08722
Pdf URL: https://arxiv.org/pdf/2503.08722
Copy Paste: [[2503.08722]] A Recipe for Improving Remote Sensing VLM Zero Shot Generalization(https://arxiv.org/abs/2503.08722)
Keywords: foundation model
Abstract: Foundation models have had a significant impact across various AI applications, enabling use cases that were previously impossible. Contrastive Visual Language Models (VLMs), in particular, have outperformed other techniques in many tasks. However, their prevalence in remote sensing (RS) is still limited, due to the scarcity of diverse remote-sensing visual-language datasets. In this work we introduce two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps. The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain, resulting in a diverse dataset with greater breadth in image styles and subject matter. These datasets are used to pre-train the MaMMUT~\citep{kuo2023mammutsimplearchitecturejoint} VLM architecture, resulting in state-of-the-art generalization performance in zero-shot cross-modal retrieval on well-known public benchmarks. Finally, we present our ongoing research to distill image-level knowledge gained in the VLM contrastive training procedure to enhance the model's localization ability. Specifically, we iteratively generate pseudo-labels for image regions based on the model's attention maps and use these labels for further training. To mitigate noisy attention maps and create robust segmentation masks, we introduce a novel attention-pooling mechanism called the Smooth-Attention-Operation.

Title: Training Plug-n-Play Knowledge Modules with Deep Context Distillation

Authors: Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vulić, Alessandro Sordoni
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08727
Pdf URL: https://arxiv.org/pdf/2503.08727
Copy Paste: [[2503.08727]] Training Plug-n-Play Knowledge Modules with Deep Context Distillation(https://arxiv.org/abs/2503.08727)
Keywords: in-context
Abstract: Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.

Title: Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models

Authors: Ishaan Malhi, Praneet Dutta, Ellie Talius, Sally Ma, Brendan Driscoll, Krista Holden, Garima Pruthi, Arunachalam Narayanaswamy
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08729
Pdf URL: https://arxiv.org/pdf/2503.08729
Copy Paste: [[2503.08729]] Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models(https://arxiv.org/abs/2503.08729)
Keywords: diffusion
Abstract: We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting & negatives to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for applications such as e-commerce and virtual product showcasing.

Title: Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models

Authors: In Cho, Youngbeom Yoo, Subin Jeon, Seon Joo Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08737
Pdf URL: https://arxiv.org/pdf/2503.08737
Copy Paste: [[2503.08737]] Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models(https://arxiv.org/abs/2503.08737)
Keywords: diffusion
Abstract: Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE, a VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.

Title: Seal Your Backdoor with Variational Defense

Authors: Ivan Sabolić, Matej Grcić, Siniša Šegvić
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.08829
Pdf URL: https://arxiv.org/pdf/2503.08829
Copy Paste: [[2503.08829]] Seal Your Backdoor with Variational Defense(https://arxiv.org/abs/2503.08829)
Keywords: self-supervised
Abstract: We propose VIBE, a model-agnostic framework that trains classifiers resilient to backdoor attacks. The key concept behind our approach is to treat malicious inputs and corrupted labels from the training dataset as observed random variables, while the actual clean labels are latent. VIBE then recovers the corresponding latent clean label posterior through variational inference. The resulting training procedure follows the expectation-maximization (EM) algorithm. The E-step infers the clean pseudolabels by solving an entropy-regularized optimal transport problem, while the M-step updates the classifier parameters via gradient descent. Being modular, VIBE can seamlessly integrate with recent advancements in self-supervised representation learning, which enhance its ability to resist backdoor attacks. We experimentally validate the method effectiveness against contemporary backdoor attacks on standard datasets, a large-scale setup with 1$k$ classes, and a dataset poisoned with multiple attacks. VIBE consistently outperforms previous defenses across all tested scenarios.

Title: Contrastive Speaker-Aware Learning for Multi-party Dialogue Generation with LLMs

Authors: Tianyu Sun, Kun Qian, Wenhong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08842
Pdf URL: https://arxiv.org/pdf/2503.08842
Copy Paste: [[2503.08842]] Contrastive Speaker-Aware Learning for Multi-party Dialogue Generation with LLMs(https://arxiv.org/abs/2503.08842)
Keywords: generative
Abstract: Multi-party dialogue generation presents significant challenges due to the complex interplay of multiple speakers and interwoven conversational threads. Traditional approaches often fall short in capturing these complexities, particularly when relying on manually annotated dialogue relations. This paper introduces Speaker-Attentive LLM (SA-LLM), a novel generative model that leverages pre-trained Large Language Models (LLMs) and a speaker-aware contrastive learning strategy to address these challenges. SA-LLM incorporates a speaker-attributed input encoding and a contrastive learning objective to implicitly learn contextual coherence and speaker roles without explicit relation annotations. Extensive experiments on the Ubuntu IRC and Movie Dialogues datasets demonstrate that SA-LLM significantly outperforms state-of-the-art baselines in automatic and human evaluations, achieving superior performance in fluency, coherence, informativeness, and response diversity. Ablation studies and detailed error analyses further validate the effectiveness of the proposed speaker-attentive training approach, highlighting its robustness across different speaker roles and context lengths. The results underscore the potential of SA-LLM as a powerful and annotation-free solution for high-quality multi-party dialogue generation.

Title: Interpretable and Robust Dialogue State Tracking via Natural Language Summarization with LLMs

Authors: Rafael Carranza, Mateo Alejandro Rojas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08857
Pdf URL: https://arxiv.org/pdf/2503.08857
Copy Paste: [[2503.08857]] Interpretable and Robust Dialogue State Tracking via Natural Language Summarization with LLMs(https://arxiv.org/abs/2503.08857)
Keywords: generative
Abstract: This paper introduces a novel approach to Dialogue State Tracking (DST) that leverages Large Language Models (LLMs) to generate natural language descriptions of dialogue states, moving beyond traditional slot-value representations. Conventional DST methods struggle with open-domain dialogues and noisy inputs. Motivated by the generative capabilities of LLMs, our Natural Language DST (NL-DST) framework trains an LLM to directly synthesize human-readable state descriptions. We demonstrate through extensive experiments on MultiWOZ 2.1 and Taskmaster-1 datasets that NL-DST significantly outperforms rule-based and discriminative BERT-based DST baselines, as well as generative slot-filling GPT-2 DST models, in both Joint Goal Accuracy and Slot Accuracy. Ablation studies and human evaluations further validate the effectiveness of natural language state generation, highlighting its robustness to noise and enhanced interpretability. Our findings suggest that NL-DST offers a more flexible, accurate, and human-understandable approach to dialogue state tracking, paving the way for more robust and adaptable task-oriented dialogue systems.

Title: Multilevel Generative Samplers for Investigating Critical Phenomena

Authors: Ankur Singha, Elia Cellini, Kim A. Nicoli, Karl Jansen, Stefan Kühn, Shinichi Nakajima
Subjects: cs.LG, hep-lat, stat.ML
Abstract URL: https://arxiv.org/abs/2503.08918
Pdf URL: https://arxiv.org/pdf/2503.08918
Copy Paste: [[2503.08918]] Multilevel Generative Samplers for Investigating Critical Phenomena(https://arxiv.org/abs/2503.08918)
Keywords: generative
Abstract: Investigating critical phenomena or phase transitions is of high interest in physics and chemistry, for which Monte Carlo (MC) simulations, a crucial tool for numerically analyzing macroscopic properties of given systems, are often hindered by an emerging divergence of correlation length -- known as scale invariance at criticality (SIC) in the renormalization group theory. SIC causes the system to behave the same at any length scale, from which many existing sampling methods suffer: long-range correlations cause critical slowing down in Markov chain Monte Carlo (MCMC), and require intractably large receptive fields for generative samplers. In this paper, we propose a Renormalization-informed Generative Critical Sampler (RiGCS) -- a novel sampler specialized for near-critical systems, where SIC is leveraged as an advantage rather than a nuisance. Specifically, RiGCS builds on MultiLevel Monte Carlo (MLMC) with Heat Bath (HB) algorithms, which perform ancestral sampling from low-resolution to high-resolution lattice configurations with site-wise-independent conditional HB sampling. Although MLMC-HB is highly efficient under exact SIC, it suffers from a low acceptance rate under slight SIC violation. Notably, SIC violation always occurs in finite-size systems, and may induce long-range and higher-order interactions in the renormalized distributions, which are not considered by independent HB samplers. RiGCS enhances MLMC-HB by replacing a part of the conditional HB sampler with generative models that capture those residual interactions and improve the sampling efficiency. Our experiments show that the effective sample size of RiGCS is a few orders of magnitude higher than state-of-the-art generative model baselines in sampling configurations for 128x128 two-dimensional Ising systems.

Title: Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Authors: Zilong Deng, Simon Khan, Shaofeng Zou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08934
Pdf URL: https://arxiv.org/pdf/2503.08934
Copy Paste: [[2503.08934]] Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model(https://arxiv.org/abs/2503.08934)
Keywords: generative
Abstract: In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $\tau$ at each step, named Iterated CVaR. %We consider the sample complexity of obtaining an $\epsilon$-optimal policy in an infinite horizon discounted MDP, given access to a generative model. % We first build a connection between Iterated CVaR RL with $(s, a)$-rectangular distributional robust RL with the specific uncertainty set for CVaR. We develop nearly matching upper and lower bounds on the sample complexity for this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $\epsilon$-optimal policy with at most $\tilde{O}\left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2}\right)$ samples, where $\gamma$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, if $\tau \geq \gamma$, then the sample complexity can be further improved to $\tilde{O}\left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$. We further show a minimax lower bound of ${\tilde{O}}\left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2}\right)$. For a constant risk level $0<\tau\leq 1$, our upper and lower bounds match with each other, demonstrating the tightness and optimality of our analyses. We also investigate a limiting case with a small risk level $\tau$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\tilde{O}\left(\frac{SA}{p_{\min}}\right)$, where $p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.

Title: I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

Authors: Yuhang Liu, Dong Gong, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.08980
Pdf URL: https://arxiv.org/pdf/2503.08980
Copy Paste: [[2503.08980]] I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?(https://arxiv.org/abs/2503.08980)
Keywords: generative
Abstract: The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also strongly reinforces the linear representation hypothesis, which posits that LLMs learn linear representations of human-interpretable concepts. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.

Title: Image Encryption Using DNA Encoding, Snake Permutation and Chaotic Substitution Techniques

Authors: Waleed Ahmed Farooqui, Jawad Ahmad, Nadeem Kureshi, Fawad Ahmed, Aizaz Ahmad Khattak, Muhammad Shahbaz Khan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.09038
Pdf URL: https://arxiv.org/pdf/2503.09038
Copy Paste: [[2503.09038]] Image Encryption Using DNA Encoding, Snake Permutation and Chaotic Substitution Techniques(https://arxiv.org/abs/2503.09038)
Keywords: diffusion
Abstract: Securing image data in IoT networks and other insecure information channels is a matter of critical concern. This paper presents a new image encryption scheme using DNA encoding, snake permutation and chaotic substitution techniques that ensures robust security of the image data with reduced computational overhead. The DNA encoding and snake permutation modules ensure effective scrambling of the pixels and result in efficient diffusion in the plaintext image. For the confusion part, the chaotic substitution technique is implemented, which substitutes the pixel values chosen randomly from 3 S-boxes. Extensive security analysis validate the efficacy of the image encryption algorithm proposed in this paper and results demonstrate that the encrypted images have an ideal information entropy of 7.9895 and an almost zero correlation coefficient of -0.001660. These results indicate a high degree of randomness and no correlation in the encrypted image.

Title: Implicit Contrastive Representation Learning with Guided Stop-gradient

Authors: Byeongchan Lee, Sehyun Lee
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.09058
Pdf URL: https://arxiv.org/pdf/2503.09058
Copy Paste: [[2503.09058]] Implicit Contrastive Representation Learning with Guided Stop-gradient(https://arxiv.org/abs/2503.09058)
Keywords: self-supervised
Abstract: In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at this https URL.

Title: Theoretical Guarantees for High Order Trajectory Refinement in Generative Flows

Authors: Chengyue Gong, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yu Tian
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09069
Pdf URL: https://arxiv.org/pdf/2503.09069
Copy Paste: [[2503.09069]] Theoretical Guarantees for High Order Trajectory Refinement in Generative Flows(https://arxiv.org/abs/2503.09069)
Keywords: diffusion, generative
Abstract: Flow matching has emerged as a powerful framework for generative modeling, offering computational advantages over diffusion models by leveraging deterministic Ordinary Differential Equations (ODEs) instead of stochastic dynamics. While prior work established the worst case optimality of standard flow matching under Wasserstein distances, the theoretical guarantees for higher-order flow matching - which incorporates acceleration terms to refine sample trajectories - remain unexplored. In this paper, we bridge this gap by proving that higher-order flow matching preserves worst case optimality as a distribution estimator. We derive upper bounds on the estimation error for second-order flow matching, demonstrating that the convergence rates depend polynomially on the smoothness of the target distribution (quantified via Besov spaces) and key parameters of the ODE dynamics. Our analysis employs neural network approximations with carefully controlled depth, width, and sparsity to bound acceleration errors across both small and large time intervals, ultimately unifying these results into a general worst case optimal bound for all time steps.

Title: Multi-Modal Foundation Models for Computational Pathology: A Survey

Authors: Dong Li, Guihong Wan, Xintao Wu, Xinyu Wu, Xiaohui Chen, Yi He, Christine G. Lian, Peter K. Sorger, Yevgeniy R. Semenov, Chen Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09091
Pdf URL: https://arxiv.org/pdf/2503.09091
Copy Paste: [[2503.09091]] Multi-Modal Foundation Models for Computational Pathology: A Survey(https://arxiv.org/abs/2503.09091)
Keywords: foundation model
Abstract: Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

Title: Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation

Authors: Zihao Chen, Hisashi Handa, Miho Ohsaki, Kimiaki Shirahama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.09094
Pdf URL: https://arxiv.org/pdf/2503.09094
Copy Paste: [[2503.09094]] Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation(https://arxiv.org/abs/2503.09094)
Keywords: self-supervised
Abstract: Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository this https URL.

Title: Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge

Authors: Maximilian Abstreiter, Sasu Tarkoma, Roberto Morabito
Subjects: cs.LG, cs.AI, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2503.09114
Pdf URL: https://arxiv.org/pdf/2503.09114
Copy Paste: [[2503.09114]] Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge(https://arxiv.org/abs/2503.09114)
Keywords: generative
Abstract: The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.

Title: Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?

Authors: Yuechen Xie, Jie Song, Huiqiong Wang, Mingli Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09122
Pdf URL: https://arxiv.org/pdf/2503.09122
Copy Paste: [[2503.09122]] Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?(https://arxiv.org/abs/2503.09122)
Keywords: diffusion, generative
Abstract: High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt-$\alpha$, and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99\% in determining the provenance of suspicious model training data, surpassing all previous methods. Code is available at this https URL.

Title: AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks

Authors: Jin Li, Ziqiang He, Anwei Luo, Jian-Fang Hu, Z. Jane Wang, Xiangui Kang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.09124
Pdf URL: https://arxiv.org/pdf/2503.09124
Copy Paste: [[2503.09124]] AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks(https://arxiv.org/abs/2503.09124)
Keywords: diffusion, generative
Abstract: Imperceptible adversarial attacks aim to fool DNNs by adding imperceptible perturbation to the input data. Previous methods typically improve the imperceptibility of attacks by integrating common attack paradigms with specifically designed perception-based losses or the capabilities of generative models. In this paper, we propose Adversarial Attacks in Diffusion (AdvAD), a novel modeling framework distinct from existing attack paradigms. AdvAD innovatively conceptualizes attacking as a non-parametric diffusion process by theoretically exploring basic modeling approach rather than using the denoising or generation abilities of regular diffusion models requiring neural networks. At each step, much subtler yet effective adversarial guidance is crafted using only the attacked model without any additional network, which gradually leads the end of diffusion process from the original image to a desired imperceptible adversarial example. Grounded in a solid theoretical foundation of the proposed non-parametric diffusion process, AdvAD achieves high attack efficacy and imperceptibility with intrinsically lower overall perturbation strength. Additionally, an enhanced version AdvAD-X is proposed to evaluate the extreme of our novel framework under an ideal scenario. Extensive experiments demonstrate the effectiveness of the proposed AdvAD and AdvAD-X. Compared with state-of-the-art imperceptible attacks, AdvAD achieves an average of 99.9$\%$ (+17.3$\%$) ASR with 1.34 (-0.97) $l_2$ distance, 49.74 (+4.76) PSNR and 0.9971 (+0.0043) SSIM against four prevalent DNNs with three different architectures on the ImageNet-compatible dataset. Code is available at this https URL.

Title: Generative Frame Sampler for Long Video Understanding

Authors: Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.09146
Pdf URL: https://arxiv.org/pdf/2503.09146
Copy Paste: [[2503.09146]] Generative Frame Sampler for Long Video Understanding(https://arxiv.org/abs/2503.09146)
Keywords: generative
Abstract: Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at this https URL.

Title: Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Authors: Hyeonho Jeong, Suhyeon Lee, Jong Chul Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09151
Pdf URL: https://arxiv.org/pdf/2503.09151
Copy Paste: [[2503.09151]] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation(https://arxiv.org/abs/2503.09151)
Keywords: diffusion, self-supervised
Abstract: We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: this https URL

Title: WonderVerse: Extendable 3D Scene Generation with Video Generative Models

Authors: Hao Feng, Zhi Zuo, Jia-hui Pan, Ka-hei Hui, Yi-hua Shao, Qi Dou, Wei Xie, Zheng-zhe Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09160
Pdf URL: https://arxiv.org/pdf/2503.09160
Copy Paste: [[2503.09160]] WonderVerse: Extendable 3D Scene Generation with Video Generative Models(https://arxiv.org/abs/2503.09160)
Keywords: foundation model, generative
Abstract: We introduce \textit{WonderVerse}, a simple but effective framework for generating extendable 3D scenes. Unlike existing methods that rely on iterative depth estimation and image inpainting, often leading to geometric distortions and inconsistencies, WonderVerse leverages the powerful world-level priors embedded within video generative foundation models to create highly immersive and geometrically coherent 3D environments. Furthermore, we propose a new technique for controllable 3D scene extension to substantially increase the scale of the generated environments. Besides, we introduce a novel abnormal sequence detection module that utilizes camera trajectory to address geometric inconsistency in the generated videos. Finally, WonderVerse is compatible with various 3D reconstruction methods, allowing both efficient and high-quality generation. Extensive experiments on 3D scene generation demonstrate that our WonderVerse, with an elegant and simple pipeline, delivers extendable and highly-realistic 3D scenes, markedly outperforming existing works that rely on more complex architectures.

Title: Incomplete Multi-view Clustering via Diffusion Contrastive Generation

Authors: Yuanyang Zhang, Yijie Lin, Weiqing Yan, Li Yao, Xinhang Wan, Guangyuan Li, Chao Zhang, Guanzhou Ke, Jie Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09185
Pdf URL: https://arxiv.org/pdf/2503.09185
Copy Paste: [[2503.09185]] Incomplete Multi-view Clustering via Diffusion Contrastive Generation(https://arxiv.org/abs/2503.09185)
Keywords: diffusion
Abstract: Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.

Title: Time-EAPCR: A Deep Learning-Based Novel Approach for Anomaly Detection Applied to the Environmental Field

Authors: Lei Liu, Yuchao Lu, Ling An, Huajie Liang, Chichun Zhou, Zhenyu Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09200
Pdf URL: https://arxiv.org/pdf/2503.09200
Copy Paste: [[2503.09200]] Time-EAPCR: A Deep Learning-Based Novel Approach for Anomaly Detection Applied to the Environmental Field(https://arxiv.org/abs/2503.09200)
Keywords: anomaly
Abstract: As human activities intensify, environmental systems such as aquatic ecosystems and water treatment systems face increasingly complex pressures, impacting ecological balance, public health, and sustainable development, making intelligent anomaly monitoring essential. However, traditional monitoring methods suffer from delayed responses, insufficient data processing capabilities, and weak generalisation, making them unsuitable for complex environmental monitoring this http URL recent years, machine learning has been widely applied to anomaly detection, but the multi-dimensional features and spatiotemporal dynamics of environmental ecological data, especially the long-term dependencies and strong variability in the time dimension, limit the effectiveness of traditional this http URL learning, with its ability to automatically learn features, captures complex nonlinear relationships, improving detection performance. However, its application in environmental monitoring is still in its early stages and requires further this http URL paper introduces a new deep learning method, Time-EAPCR (Time-Embedding-Attention-Permutated CNN-Residual), and applies it to environmental science. The method uncovers feature correlations, captures temporal evolution patterns, and enables precise anomaly detection in environmental this http URL validated Time-EAPCR's high accuracy and robustness across four publicly available environmental datasets. Experimental results show that the method efficiently handles multi-source data, improves detection accuracy, and excels across various scenarios with strong adaptability and generalisation. Additionally, a real-world river monitoring dataset confirmed the feasibility of its deployment, providing reliable technical support for environmental monitoring.

Title: Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latant Space

Authors: Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Fu Liu, Peng Jia, Xianpeng Lang, Xiaolong Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09215
Pdf URL: https://arxiv.org/pdf/2503.09215
Copy Paste: [[2503.09215]] Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latant Space(https://arxiv.org/abs/2503.09215)
Keywords: diffusion
Abstract: Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the end-to-end autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In addition, it remains a challenge to match multiple trajectories with each vehicle in the video to control the video generation. To address above issues, a driving \textbf{W}orld \textbf{M}odel named EOT-WM is proposed in this paper, unifying \textbf{E}go-\textbf{O}ther vehicle \textbf{T}rajectories in videos. Specifically, we first project ego and other vehicle trajectories in the BEV space into the image coordinate to match each trajectory with its corresponding vehicle in the video. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30\% in FID and 55\% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

Title: N2C2: Nearest Neighbor Enhanced Confidence Calibration for Cross-Lingual In-Context Learning

Authors: Jie He, Simon Yu, Deyi Xiong, Víctor Gutiérrez-Basulto, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.09218
Pdf URL: https://arxiv.org/pdf/2503.09218
Copy Paste: [[2503.09218]] N2C2: Nearest Neighbor Enhanced Confidence Calibration for Cross-Lingual In-Context Learning(https://arxiv.org/abs/2503.09218)
Keywords: in-context
Abstract: Recent advancements of in-context learning (ICL) show language models can significantly improve their performance when demonstrations are provided. However, little attention has been paid to model calibration and prediction confidence of ICL in cross-lingual scenarios. To bridge this gap, we conduct a thorough analysis of ICL for cross-lingual sentiment classification. Our findings suggest that ICL performs poorly in cross-lingual scenarios, exhibiting low accuracy and presenting high calibration errors. In response, we propose a novel approach, N2C2, which employs a -nearest neighbors augmented classifier for prediction confidence calibration. N2C2 narrows the prediction gap by leveraging a datastore of cached few-shot instances. Specifically, N2C2 integrates the predictions from the datastore and incorporates confidence-aware distribution, semantically consistent retrieval representation, and adaptive neighbor combination modules to effectively utilize the limited number of supporting instances. Evaluation on two multilingual sentiment classification datasets demonstrates that N2C2 outperforms traditional ICL. It surpasses fine tuning, prompt tuning and recent state-of-the-art methods in terms of accuracy and calibration errors.

Title: Active Learning Inspired ControlNet Guidance for Augmenting Semantic Segmentation Datasets

Authors: Hannah Kniesel, Pedro Hermosilla, Timo Ropinski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09221
Pdf URL: https://arxiv.org/pdf/2503.09221
Copy Paste: [[2503.09221]] Active Learning Inspired ControlNet Guidance for Augmenting Semantic Segmentation Datasets(https://arxiv.org/abs/2503.09221)
Keywords: diffusion
Abstract: Recent advances in conditional image generation from diffusion models have shown great potential in achieving impressive image quality while preserving the constraints introduced by the user. In particular, ControlNet enables precise alignment between ground truth segmentation masks and the generated image content, allowing the enhancement of training datasets in segmentation tasks. This raises a key question: Can ControlNet additionally be guided to generate the most informative synthetic samples for a specific task? Inspired by active learning, where the most informative real-world samples are selected based on sample difficulty or model uncertainty, we propose the first approach to integrate active learning-based selection metrics into the backward diffusion process for sample generation. Specifically, we explore uncertainty, query by committee, and expected model change, which are commonly used in active learning, and demonstrate their application for guiding the sample generation process through gradient approximation. Our method is training-free, modifying only the backward diffusion process, allowing it to be used on any pretrained ControlNet. Using this process, we show that segmentation models trained with guided synthetic data outperform those trained on non-guided synthetic data. Our work underscores the need for advanced control mechanisms for diffusion-based models, which are not only aligned with image content but additionally downstream task performance, highlighting the true potential of synthetic data generation.

Title: NAMI: Efficient Image Generation via Progressive Rectified Flow Transformers

Authors: Yuhang Ma, Bo Cheng, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09242
Pdf URL: https://arxiv.org/pdf/2503.09242
Copy Paste: [[2503.09242]] NAMI: Efficient Image Generation via Progressive Rectified Flow Transformers(https://arxiv.org/abs/2503.09242)
Keywords: diffusion
Abstract: Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer layers at the low-resolution stages to generate image layouts and concept contours, and progressively adding more layers as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce progressive rectified flow transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 40% to generate a 1024 resolution image; (3) We propose NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and prevent data leakage from open-source benchmarks. The results show that our model is competitive with state-of-the-art models.

Title: UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

Authors: Haoxuan Wang, Jinlong Peng, Qingdong He, Hao Yang, Ying Jin, Jiafu Wu, Xiaobin Hu, Yanjie Pan, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09277
Pdf URL: https://arxiv.org/pdf/2503.09277
Copy Paste: [[2503.09277]] UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer(https://arxiv.org/abs/2503.09277)
Keywords: diffusion, generative
Abstract: With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance.

Title: Unmask It! AI-Generated Product Review Detection in Dravidian Languages

Authors: Somsubhra De, Advait Vats
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09289
Pdf URL: https://arxiv.org/pdf/2503.09289
Copy Paste: [[2503.09289]] Unmask It! AI-Generated Product Review Detection in Dravidian Languages(https://arxiv.org/abs/2503.09289)
Keywords: generative
Abstract: The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and MalayalamBERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.

Title: Detecting and Preventing Data Poisoning Attacks on AI Models

Authors: Halima I. Kure, Pradipta Sarkar, Ahmed B. Ndanusa, Augustine O. Nwajana
Subjects: cs.CR, eess.IV
Abstract URL: https://arxiv.org/abs/2503.09302
Pdf URL: https://arxiv.org/pdf/2503.09302
Copy Paste: [[2503.09302]] Detecting and Preventing Data Poisoning Attacks on AI Models(https://arxiv.org/abs/2503.09302)
Keywords: anomaly
Abstract: This paper investigates the critical issue of data poisoning attacks on AI models, a growing concern in the ever-evolving landscape of artificial intelligence and cybersecurity. As advanced technology systems become increasingly prevalent across various sectors, the need for robust defence mechanisms against adversarial attacks becomes paramount. The study aims to develop and evaluate novel techniques for detecting and preventing data poisoning attacks, focusing on both theoretical frameworks and practical applications. Through a comprehensive literature review, experimental validation using the CIFAR-10 and Insurance Claims datasets, and the development of innovative algorithms, this paper seeks to enhance the resilience of AI models against malicious data manipulation. The study explores various methods, including anomaly detection, robust optimization strategies, and ensemble learning, to identify and mitigate the effects of poisoned data during model training. Experimental results indicate that data poisoning significantly degrades model performance, reducing classification accuracy by up to 27% in image recognition tasks (CIFAR-10) and 22% in fraud detection models (Insurance Claims dataset). The proposed defence mechanisms, including statistical anomaly detection and adversarial training, successfully mitigated poisoning effects, improving model robustness and restoring accuracy levels by an average of 15-20%. The findings further demonstrate that ensemble learning techniques provide an additional layer of resilience, reducing false positives and false negatives caused by adversarial data injections.

Title: Revealing the Implicit Noise-based Imprint of Generative Models

Authors: Xinghan Li, Jingjing Chen, Yue Yu, Xue Song, Haijun Shan, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09314
Pdf URL: https://arxiv.org/pdf/2503.09314
Copy Paste: [[2503.09314]] Revealing the Implicit Noise-based Imprint of Generative Models(https://arxiv.org/abs/2503.09314)
Keywords: generative
Abstract: With the rapid advancement of vision generation models, the potential security risks stemming from synthetic visual content have garnered increasing attention, posing significant challenges for AI-generated image detection. Existing methods suffer from inadequate generalization capabilities, resulting in unsatisfactory performance on emerging generative models. To address this issue, this paper presents a novel framework that leverages noise-based model-specific imprint for the detection task. Specifically, we propose a novel noise-based imprint simulator to capture intrinsic patterns imprinted in images generated by different models. By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data, thereby enhancing generalization and robustness. Furthermore, we design a new pipeline that pioneers the use of noise patterns, derived from a noise-based imprint extractor, alongside other visual features for AI-generated image detection, resulting in a significant improvement in performance. Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.

Title: Unified Dense Prediction of Video Diffusion

Authors: Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09344
Pdf URL: https://arxiv.org/pdf/2503.09344
Copy Paste: [[2503.09344]] Unified Dense Prediction of Video Diffusion(https://arxiv.org/abs/2503.09344)
Keywords: diffusion
Abstract: We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.

Title: Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X

Authors: Katharina Prasse, Marcel Kleinmann, Inken Adam, Kerstin Beckersjuergen, Andreas Edte, Jona Frroku, Timotheus Gumpp, Steffen Jung, Isaac Bravo, Stefanie Walter, Margret Keuper
Subjects: cs.CV, cs.SI
Abstract URL: https://arxiv.org/abs/2503.09361
Pdf URL: https://arxiv.org/pdf/2503.09361
Copy Paste: [[2503.09361]] Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X(https://arxiv.org/abs/2503.09361)
Keywords: foundation model
Abstract: Climate change is one of the most pressing challenges of the 21st century, sparking widespread discourse across social media platforms. Activists, policymakers, and researchers seek to understand public sentiment and narratives while access to social media data has become increasingly restricted in the post-API era. In this study, we analyze a dataset of climate change-related tweets from X (formerly Twitter) shared in 2019, containing 730k tweets along with the shared images. Our approach integrates statistical analysis, image classification, object detection, and sentiment analysis to explore visual narratives in climate discourse. Additionally, we introduce a graphical user interface (GUI) to facilitate interactive data exploration. Our findings reveal key themes in climate communication, highlight sentiment divergence between images and text, and underscore the strengths and limitations of foundation models in analyzing social media imagery. By releasing our code and tools, we aim to support future research on the intersection of climate change, social media, and computer vision.

Title: Towards Graph Foundation Models: A Transferability Perspective

Authors: Yuxiang Wang, Wenqi Fan, Suhang Wang, Yao Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09363
Pdf URL: https://arxiv.org/pdf/2503.09363
Copy Paste: [[2503.09363]] Towards Graph Foundation Models: A Transferability Perspective(https://arxiv.org/abs/2503.09363)
Keywords: foundation model
Abstract: In recent years, Graph Foundation Models (GFMs) have gained significant attention for their potential to generalize across diverse graph domains and tasks. Some works focus on Domain-Specific GFMs, which are designed to address a variety of tasks within a specific domain, while others aim to create General-Purpose GFMs that extend the capabilities of domain-specific models to multiple domains. Regardless of the type, transferability is crucial for applying GFMs across different domains and tasks. However, achieving strong transferability is a major challenge due to the structural, feature, and distributional variations in graph data. To date, there has been no systematic research examining and analyzing GFMs from the perspective of transferability. To bridge the gap, we present the first comprehensive taxonomy that categorizes and analyzes existing GFMs through the lens of transferability, structuring GFMs around their application scope (domain-specific vs. general-purpose) and their approaches to knowledge acquisition and transfer. We provide a structured perspective on current progress and identify potential pathways for advancing GFM generalization across diverse graph datasets and tasks. We aims to shed light on the current landscape of GFMs and inspire future research directions in GFM development.

Title: PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling

Authors: Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.09368
Pdf URL: https://arxiv.org/pdf/2503.09368
Copy Paste: [[2503.09368]] PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling(https://arxiv.org/abs/2503.09368)
Keywords: diffusion
Abstract: We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at this https URL.

Title: Close-up-GS: Enhancing Close-Up View Synthesis in 3D Gaussian Splatting with Progressive Self-Training

Authors: Jiatong Xia, Lingqiao Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09396
Pdf URL: https://arxiv.org/pdf/2503.09396
Copy Paste: [[2503.09396]] Close-up-GS: Enhancing Close-Up View Synthesis in 3D Gaussian Splatting with Progressive Self-Training(https://arxiv.org/abs/2503.09396)
Keywords: generative
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in synthesizing novel views after training on a given set of viewpoints. However, its rendering quality deteriorates when the synthesized view deviates significantly from the training views. This decline occurs due to (1) the model's difficulty in generalizing to out-of-distribution scenarios and (2) challenges in interpolating fine details caused by substantial resolution changes and occlusions. A notable case of this limitation is close-up view generation--producing views that are significantly closer to the object than those in the training set. To tackle this issue, we propose a novel approach for close-up view generation based by progressively training the 3DGS model with self-generated data. Our solution is based on three key ideas. First, we leverage the See3D model, a recently introduced 3D-aware generative model, to enhance the details of rendered views. Second, we propose a strategy to progressively expand the ``trust regions'' of the 3DGS model and update a set of reference views for See3D. Finally, we introduce a fine-tuning strategy to carefully update the 3DGS model with training data generated from the above schemes. We further define metrics for close-up views evaluation to facilitate better research on this problem. By conducting evaluations on specifically selected scenarios for close-up views, our proposed approach demonstrates a clear advantage over competitive solutions.

Title: ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

Authors: Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09399
Pdf URL: https://arxiv.org/pdf/2503.09399
Copy Paste: [[2503.09399]] ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation(https://arxiv.org/abs/2503.09399)
Keywords: foundation model
Abstract: Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at this https URL.

Title: VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Authors: Kevin Qinghong Lin, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09402
Pdf URL: https://arxiv.org/pdf/2503.09402
Copy Paste: [[2503.09402]] VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary(https://arxiv.org/abs/2503.09402)
Keywords: generative
Abstract: Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at this https URL.

Title: Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation

Authors: Xiuzhen Guo, Lianyuan Yu, Ji Shi, Na Lei, Hongxiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09408
Pdf URL: https://arxiv.org/pdf/2503.09408
Copy Paste: [[2503.09408]] Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2503.09408)
Keywords: diffusion, generative
Abstract: Semi-supervised learning utilizes insights from unlabeled data to improve model generalization, thereby reducing reliance on large labeled datasets. Most existing studies focus on limited samples and fail to capture the overall data distribution. We contend that combining distributional information with detailed information is crucial for achieving more robust and accurate segmentation results. On the one hand, with its robust generative capabilities, diffusion models (DM) learn data distribution effectively. However, it struggles with fine detail capture, leading to generated images with misleading details. Combining DM with convolutional neural networks (CNNs) enables the former to learn data distribution while the latter corrects fine details. While capturing complete high-frequency details by CNNs requires substantial computational resources and is susceptible to local noise. On the other hand, given that both labeled and unlabeled data come from the same distribution, we believe that regions in unlabeled data similar to overall class semantics to labeled data are likely to belong to the same class, while regions with minimal similarity are less likely to. This work introduces a semi-supervised medical image segmentation framework from the distribution perspective (Diff-CL). Firstly, we propose a cross-pseudo-supervision learning mechanism between diffusion and convolution segmentation networks. Secondly, we design a high-frequency mamba module to capture boundary and detail information globally. Finally, we apply contrastive learning for label propagation from labeled to unlabeled data. Our method achieves state-of-the-art (SOTA) performance across three datasets, including left atrium, brain tumor, and NIH pancreas datasets.

Title: Monte Carlo Diffusion for Generalizable Learning-Based RANSAC

Authors: Jiale Wang, Chen Zhao, Wei Ke, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09410
Pdf URL: https://arxiv.org/pdf/2503.09410
Copy Paste: [[2503.09410]] Monte Carlo Diffusion for Generalizable Learning-Based RANSAC(https://arxiv.org/abs/2503.09410)
Keywords: diffusion
Abstract: Random Sample Consensus (RANSAC) is a fundamental approach for robustly estimating parametric models from noisy data. Existing learning-based RANSAC methods utilize deep learning to enhance the robustness of RANSAC against outliers. However, these approaches are trained and tested on the data generated by the same algorithms, leading to limited generalization to out-of-distribution data during inference. Therefore, in this paper, we introduce a novel diffusion-based paradigm that progressively injects noise into ground-truth data, simulating the noisy conditions for training learning-based RANSAC. To enhance data diversity, we incorporate Monte Carlo sampling into the diffusion paradigm, approximating diverse data distributions by introducing different types of randomness at multiple stages. We evaluate our approach in the context of feature matching through comprehensive experiments on the ScanNet and MegaDepth datasets. The experimental results demonstrate that our Monte Carlo diffusion mechanism significantly improves the generalization ability of learning-based RANSAC. We also develop extensive ablation studies that highlight the effectiveness of key components in our framework.

Title: Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space

Authors: Yifan Zhou, Zeqi Xiao, Shuai Yang, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09419
Pdf URL: https://arxiv.org/pdf/2503.09419
Copy Paste: [[2503.09419]] Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space(https://arxiv.org/abs/2503.09419)
Keywords: diffusion
Abstract: Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation. Code is available at: this https URL

Title: Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation

Authors: Yaorui Shi, Jiaqi Yang, Sihang Li, Junfeng Fang, Xiang Wang, Zhiyuan Liu, Yang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09427
Pdf URL: https://arxiv.org/pdf/2503.09427
Copy Paste: [[2503.09427]] Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation(https://arxiv.org/abs/2503.09427)
Keywords: generative
Abstract: Pre-trained language models (PLMs) have revolutionized scientific research, yet their application to single-cell analysis remains limited. Text PLMs cannot process single-cell RNA sequencing data, while cell PLMs lack the ability to handle free text, restricting their use in multimodal tasks. Existing efforts to bridge these modalities often suffer from information loss or inadequate single-modal pre-training, leading to suboptimal performances. To address these challenges, we propose Single-Cell MultiModal Generative Pre-trained Transformer (scMMGPT), a unified PLM for joint cell and text modeling. scMMGPT effectively integrates the state-of-the-art cell and text PLMs, facilitating cross-modal knowledge sharing for improved performance. To bridge the text-cell modality gap, scMMGPT leverages dedicated cross-modal projectors, and undergoes extensive pre-training on 27 million cells -- the largest dataset for multimodal cell-text PLMs to date. This large-scale pre-training enables scMMGPT to excel in joint cell-text tasks, achieving an 84\% relative improvement of textual discrepancy for cell description generation, 20.5\% higher accuracy for cell type annotation, and 4\% improvement in $k$-NN accuracy for text-conditioned pseudo-cell generation, outperforming baselines.

Title: SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

Authors: Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09439
Pdf URL: https://arxiv.org/pdf/2503.09439
Copy Paste: [[2503.09439]] SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation(https://arxiv.org/abs/2503.09439)
Keywords: diffusion
Abstract: Traditional production workflow of high-precision 3D mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation. However, although the latest state-of-the-arts are already capable of generating plausible structures and intricate appearances from images or text prompts, the actual mesh surfaces are typically over-smoothing and lack geometric details. This paper introduces SuperCarver, a 3D geometry super-resolution framework particularly tailored for adding texture-consistent surface details to given coarse meshes. Technically, we start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve geometric detail generation, we develop a deterministic prior-guided normal diffusion model fine-tuned on a carefully curated dataset of paired low-poly and high-poly normal renderings. To optimize mesh structures from potentially imperfect normal map predictions, we design a simple yet effective noise-resistant inverse rendering scheme based on distance field deformation. Extensive experiments show that SuperCarver generates realistic and expressive surface details as depicted by specific texture appearances, making it a powerful tool for automatically upgrading massive outdated low-quality assets and shortening the iteration cycle of high-quality mesh production in practical applications.

Title: Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models

Authors: Zhihua Tian, Sirun Nan, Ming Xu, Shengfang Zhai, Wenjie Qu, Jian Liu, Kui Ren, Ruoxi Jia, Jiaheng Zhang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09446
Pdf URL: https://arxiv.org/pdf/2503.09446
Copy Paste: [[2503.09446]] Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.09446)
Keywords: diffusion
Abstract: Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people's concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD's effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: this https URL.

Title: How Well Does Your Tabular Generator Learn the Structure of Tabular Data?

Authors: Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09453
Pdf URL: https://arxiv.org/pdf/2503.09453
Copy Paste: [[2503.09453]] How Well Does Your Tabular Generator Learn the Structure of Tabular Data?(https://arxiv.org/abs/2503.09453)
Keywords: generative
Abstract: Heterogeneous tabular data poses unique challenges in generative modelling due to its fundamentally different underlying data structure compared to homogeneous modalities, such as images and text. Although previous research has sought to adapt the successes of generative modelling in homogeneous modalities to the tabular domain, defining an effective generator for tabular data remains an open problem. One major reason is that the evaluation criteria inherited from other modalities often fail to adequately assess whether tabular generative models effectively capture or utilise the unique structural information encoded in tabular data. In this paper, we carefully examine the limitations of the prevailing evaluation framework and introduce $\textbf{TabStruct}$, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension. Specifically, TabStruct evaluates the alignment of causal structures in real and synthetic data, providing a direct measure of how effectively tabular generative models learn the structure of tabular data. Through extensive experiments using generators from eight categories on seven datasets with expert-validated causal graphical structures, we show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension. Our findings highlight the importance of tabular data structure and offer practical guidance for developing more effective and robust tabular generative models. Code is available at this https URL.

Title: Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness

Authors: Beier Zhu, Jiequan Cui, Hanwang Zhang, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09487
Pdf URL: https://arxiv.org/pdf/2503.09487
Copy Paste: [[2503.09487]] Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness(https://arxiv.org/abs/2503.09487)
Keywords: foundation model
Abstract: While image-text foundation models have succeeded across diverse downstream tasks, they still face challenges in the presence of spurious correlations between the input and label. To address this issue, we propose a simple three-step approach,Project-Probe-Aggregate (PPA), that enables parameter-efficient fine-tuning for foundation models without relying on group annotations. Building upon the failure-based debiasing scheme, our method, PPA, improves its two key components: minority samples identification and the robust training algorithm. Specifically, we first train biased classifiers by projecting image features onto the nullspace of class proxies from text encoders. Next, we infer group labels using the biased classifier and probe group targets with prior correction. Finally, we aggregate group weights of each class to produce the debiased classifier. Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error, mitigating spurious correlations. Extensive experimental results confirm the effectiveness of our PPA: it outperforms the state-of-the-art by an average worst-group accuracy while requiring less than 0.01% tunable parameters without training group labels.

Title: DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction

Authors: Junjie Zhou, Shouju Wang, Yuxia Tang, Qi Zhu, Daoqiang Zhang, Wei Shao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.09491
Pdf URL: https://arxiv.org/pdf/2503.09491
Copy Paste: [[2503.09491]] DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction(https://arxiv.org/abs/2503.09491)
Keywords: diffusion, generative
Abstract: The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among multi-modal TME components may cause side effects i.e., the best uni-modal model may outperform the joint generative model. To address the above issues, we propose a \textbf{D}ivergence-\textbf{A}ware \textbf{M}ulti-\textbf{M}odal \textbf{Diffusion} model (i.e., \textbf{DAMM-Diffusion}) to adaptively generate the prediction results from uni-modal and multi-modal branches in a unified network. In detail, the uni-modal branch is composed of the U-Net architecture while the multi-modal branch extends it by introducing two novel fusion modules i.e., Multi-Modal Fusion Module (MMFM) and Uncertainty-Aware Fusion Module (UAFM). Specifically, the MMFM is proposed to fuse features from multiple modalities, while the UAFM module is introduced to learn the uncertainty map for cross-attention computation. Following the individual prediction results from each branch, the Divergence-Aware Multi-Modal Predictor (DAMMP) module is proposed to assess the consistency of multi-modal data with the uncertainty map, which determines whether the final prediction results come from multi-modal or uni-modal predictions. We predict the NPs distribution given the TME components of tumor vessels and cell nuclei, and the experimental results show that DAMM-Diffusion can generate the distribution of NPs with higher accuracy than the comparing methods. Additional results on the multi-modal brain image synthesis task further validate the effectiveness of the proposed method.

Title: Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection

Authors: Romain Thoreau, Valerio Marsocci, Dawa Derksen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09493
Pdf URL: https://arxiv.org/pdf/2503.09493
Copy Paste: [[2503.09493]] Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection(https://arxiv.org/abs/2503.09493)
Keywords: foundation model
Abstract: As large-scale heterogeneous data sets become increasingly available, adapting foundation models at low cost has become a key issue. Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low "intrinsic rank" of parameter updates during adaptation. In this paper, we argue that incorporating stronger inductive biases in both data and models can enhance the adaptation of Geospatial Foundation Models (GFMs), pretrained on RGB satellite images, to other types of optical satellite data. Specifically, the pretrained parameters of GFMs serve as a strong prior for the spatial structure of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10$\times$ fewer parameters for classification and segmentation tasks. The code will be made publicly available.

Title: Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder

Authors: Junjie Zhou, Jiao Tang, Yingli Zuo, Peng Wan, Daoqiang Zhang, Wei Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09496
Pdf URL: https://arxiv.org/pdf/2503.09496
Copy Paste: [[2503.09496]] Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder(https://arxiv.org/abs/2503.09496)
Keywords: generative
Abstract: The integrative analysis of histopathological images and genomic data has received increasing attention for survival prediction of human cancers. However, the existing studies always hold the assumption that full modalities are available. As a matter of fact, the cost for collecting genomic data is high, which sometimes makes genomic data unavailable in testing samples. A common way of tackling such incompleteness is to generate the genomic representations from the pathology images. Nevertheless, such strategy still faces the following two challenges: (1) The gigapixel whole slide images (WSIs) are huge and thus hard for representation. (2) It is difficult to generate the genomic embeddings with diverse function categories in a unified generative framework. To address the above challenges, we propose a Conditional Latent Differentiation Variational AutoEncoder (LD-CVAE) for robust multimodal survival prediction, even with missing genomic data. Specifically, a Variational Information Bottleneck Transformer (VIB-Trans) module is proposed to learn compressed pathological representations from the gigapixel WSIs. To generate different functional genomic features, we develop a novel Latent Differentiation Variational AutoEncoder (LD-VAE) to learn the common and specific posteriors for the genomic embeddings with diverse functions. Finally, we use the product-of-experts technique to integrate the genomic common posterior and image posterior for the joint latent distribution estimation in LD-CVAE. We test the effectiveness of our method on five different cancer datasets, and the experimental results demonstrate its superiority in both complete and missing modality scenarios.

Title: MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions

Authors: Zhe Xu, Daoyuan Chen, Zhenqing Ling, Yaliang Li, Ying Shen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.09499
Pdf URL: https://arxiv.org/pdf/2503.09499
Copy Paste: [[2503.09499]] MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions(https://arxiv.org/abs/2503.09499)
Keywords: self-supervised
Abstract: Large vision-language models (VLMs) face challenges in achieving robust, transferable reasoning abilities due to reliance on labor-intensive manual instruction datasets or computationally expensive self-supervised methods. To address these issues, we introduce MindGYM, a framework that enhances VLMs through synthetic self-challenging questions, consisting of three stages: (1) Seed Single-Hop Question Synthesis, generating cognitive questions across textual (e.g., logical deduction) and multimodal contexts (e.g., diagram-based queries) spanning eight semantic areas like ethical analysis; (2) Challenging Multi-Hop Question Synthesis, combining seed questions via diverse principles like bridging, visual-textual alignment, to create multi-step problems demanding deeper reasoning; and (3) Thinking-Induced Curriculum Fine-Tuning, a structured pipeline that progressively trains the model from scaffolded reasoning to standalone inference. By leveraging the model's self-synthesis capability, MindGYM achieves high data efficiency (e.g., +16% gains on MathVision-Mini with only 400 samples), computational efficiency (reducing both training and inference costs), and robust generalization across tasks. Extensive evaluations on seven benchmarks demonstrate superior performance over strong baselines, with notable improvements (+15.77% win rates) in reasoning depth and breadth validated via GPT-based scoring. MindGYM underscores the viability of self-challenging for refining VLM capabilities while minimizing human intervention and resource demands. Code and data are released to advance multimodal reasoning research.

Title: CM-Diff: A Single Generative Network for Bidirectional Cross-Modality Translation Diffusion Model Between Infrared and Visible Images

Authors: Bin Hu, Chenqiang Gao, Shurui Liu, Junjie Guo, Fang Chen, Fangcen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09514
Pdf URL: https://arxiv.org/pdf/2503.09514
Copy Paste: [[2503.09514]] CM-Diff: A Single Generative Network for Bidirectional Cross-Modality Translation Diffusion Model Between Infrared and Visible Images(https://arxiv.org/abs/2503.09514)
Keywords: diffusion, generative
Abstract: The image translation method represents a crucial approach for mitigating information deficiencies in the infrared and visible modalities, while also facilitating the enhancement of modality-specific datasets. However, existing methods for infrared and visible image translation either achieve unidirectional modality translation or rely on cycle consistency for bidirectional modality translation, which may result in suboptimal performance. In this work, we present the cross-modality translation diffusion model (CM-Diff) for simultaneously modeling data distributions in both the infrared and visible modalities. We address this challenge by combining translation direction labels for guidance during training with cross-modality feature control. Specifically, we view the establishment of the mapping relationship between the two modalities as the process of learning data distributions and understanding modality differences, achieved through a novel Bidirectional Diffusion Training (BDT) strategy. Additionally, we propose a Statistical Constraint Inference (SCI) strategy to ensure the generated image closely adheres to the data distribution of the target modality. Experimental results demonstrate the superiority of our CM-Diff over state-of-the-art methods, highlighting its potential for generating dual-modality datasets.

Title: Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging

Authors: Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, Utku Ozbulak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09535
Pdf URL: https://arxiv.org/pdf/2503.09535
Copy Paste: [[2503.09535]] Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging(https://arxiv.org/abs/2503.09535)
Keywords: self-supervised
Abstract: Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.

Title: GenHPE: Generative Counterfactuals for 3D Human Pose Estimation with Radio Frequency Signals

Authors: Shuokang Huang, Julie A. McCann
Subjects: cs.CV, cs.AI, cs.MM, eess.SP
Abstract URL: https://arxiv.org/abs/2503.09537
Pdf URL: https://arxiv.org/pdf/2503.09537
Copy Paste: [[2503.09537]] GenHPE: Generative Counterfactuals for 3D Human Pose Estimation with Radio Frequency Signals(https://arxiv.org/abs/2503.09537)
Keywords: generative
Abstract: Human pose estimation (HPE) detects the positions of human body joints for various applications. Compared to using cameras, HPE using radio frequency (RF) signals is non-intrusive and more robust to adverse conditions, exploiting the signal variations caused by human interference. However, existing studies focus on single-domain HPE confined by domain-specific confounders, which cannot generalize to new domains and result in diminished HPE performance. Specifically, the signal variations caused by different human body parts are entangled, containing subject-specific confounders. RF signals are also intertwined with environmental noise, involving environment-specific confounders. In this paper, we propose GenHPE, a 3D HPE approach that generates counterfactual RF signals to eliminate domain-specific confounders. GenHPE trains generative models conditioned on human skeleton labels, learning how human body parts and confounders interfere with RF signals. We manipulate skeleton labels (i.e., removing body parts) as counterfactual conditions for generative models to synthesize counterfactual RF signals. The differences between counterfactual signals approximately eliminate domain-specific confounders and regularize an encoder-decoder model to learn domain-independent representations. Such representations help GenHPE generalize to new subjects/environments for cross-domain 3D HPE. We evaluate GenHPE on three public datasets from WiFi, ultra-wideband, and millimeter wave. Experimental results show that GenHPE outperforms state-of-the-art methods and reduces estimation errors by up to 52.2mm for cross-subject HPE and 10.6mm for cross-environment HPE.

Title: TPDiff: Temporal Pyramid Video Diffusion Model

Authors: Lingmin Ran, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09566
Pdf URL: https://arxiv.org/pdf/2503.09566
Copy Paste: [[2503.09566]] TPDiff: Temporal Pyramid Video Diffusion Model(https://arxiv.org/abs/2503.09566)
Keywords: diffusion
Abstract: The development of video diffusion models unveils a significant challenge: the substantial computational demands. To mitigate this challenge, we note that the reverse process of diffusion exhibits an inherent entropy-reducing nature. Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary. Based on this insight, we propose TPDiff, a unified framework to enhance training and inference efficiency. By dividing diffusion into several stages, our framework progressively increases frame rate along the diffusion process with only the last stage operating on full frame rate, thereby optimizing computational efficiency. To train the multi-stage diffusion model, we introduce a dedicated training framework: stage-wise diffusion. By solving the partitioned probability flow ordinary differential equations (ODE) of diffusion under aligned data and noise, our training strategy is applicable to various diffusion forms and further enhances training efficiency. Comprehensive experimental evaluations validate the generality of our method, demonstrating 50% reduction in training cost and 1.5x improvement in inference efficiency.

Title: Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Authors: Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09573
Pdf URL: https://arxiv.org/pdf/2503.09573
Copy Paste: [[2503.09573]] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models(https://arxiv.org/abs/2503.09573)
Keywords: diffusion
Abstract: Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: this https URL

Title: Minimax Optimality of the Probability Flow ODE for Diffusion Models

Authors: Changxiao Cai, Gen Li
Subjects: cs.LG, cs.IT, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09583
Pdf URL: https://arxiv.org/pdf/2503.09583
Copy Paste: [[2503.09583]] Minimax Optimality of the Probability Flow ODE for Diffusion Models(https://arxiv.org/abs/2503.09583)
Keywords: diffusion, generative
Abstract: Score-based diffusion models have become a foundational paradigm for modern generative modeling, demonstrating exceptional capability in generating samples from complex high-dimensional distributions. Despite the dominant adoption of probability flow ODE-based samplers in practice due to their superior sampling efficiency and precision, rigorous statistical guarantees for these methods have remained elusive in the literature. This work develops the first end-to-end theoretical framework for deterministic ODE-based samplers that establishes near-minimax optimal guarantees under mild assumptions on target data distributions. Specifically, focusing on subgaussian distributions with $\beta$-Hölder smooth densities for $\beta\leq 2$, we propose a smooth regularized score estimator that simultaneously controls both the $L^2$ score error and the associated mean Jacobian error. Leveraging this estimator within a refined convergence analysis of the ODE-based sampling process, we demonstrate that the resulting sampler achieves the minimax rate in total variation distance, modulo logarithmic factors. Notably, our theory comprehensively accounts for all sources of error in the sampling process and does not require strong structural conditions such as density lower bounds or Lipschitz/smooth scores on target distributions, thereby covering a broad range of practical data distributions.

Title: PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Authors: Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, Saining Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09595
Pdf URL: https://arxiv.org/pdf/2503.09595
Copy Paste: [[2503.09595]] PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop(https://arxiv.org/abs/2503.09595)
Keywords: diffusion, generative
Abstract: Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.

Title: RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling

Authors: Itay Chachy, Guy Yariv, Sagie Benaim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09601
Pdf URL: https://arxiv.org/pdf/2503.09601
Copy Paste: [[2503.09601]] RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling(https://arxiv.org/abs/2503.09601)
Keywords: diffusion
Abstract: Score Distillation Sampling (SDS) has emerged as an effective technique for leveraging 2D diffusion priors for tasks such as text-to-3D generation. While powerful, SDS struggles with achieving fine-grained alignment to user intent. To overcome this, we introduce RewardSDS, a novel approach that weights noise samples based on alignment scores from a reward model, producing a weighted SDS loss. This loss prioritizes gradients from noise samples that yield aligned high-reward output. Our approach is broadly applicable and can extend SDS-based methods. In particular, we demonstrate its applicability to Variational Score Distillation (VSD) by introducing RewardVSD. We evaluate RewardSDS and RewardVSD on text-to-image, 2D editing, and text-to-3D generation tasks, showing significant improvements over SDS and VSD on a diverse set of metrics measuring generation quality and alignment to desired reward models, enabling state-of-the-art performance. Project page is available at https://itaychachy. this http URL.