2025-02-28

Title: On the Interpolation Effect of Score Smoothing

Authors: Zhengdao Chen
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2502.19499
Pdf URL: https://arxiv.org/pdf/2502.19499
Copy Paste: [[2502.19499]] On the Interpolation Effect of Score Smoothing(https://arxiv.org/abs/2502.19499)
Keywords: diffusion
Abstract: Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we examine the hypothesis that their generalization ability arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional linear subspace, we study the interplay between score smoothing and the denoising dynamics with mathematically solvable models. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate among the training data within their subspace while avoiding full memorization. We also present evidence that learning score functions with regularized neural networks can have a similar effect on the denoising dynamics as score smoothing.

Title: TRIX: A More Expressive Model for Zero-shot Domain Transfer in Knowledge Graphs

Authors: Yucheng Zhang, Beatrice Bevilacqua, Mikhail Galkin, Bruno Ribeiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.19512
Pdf URL: https://arxiv.org/pdf/2502.19512
Copy Paste: [[2502.19512]] TRIX: A More Expressive Model for Zero-shot Domain Transfer in Knowledge Graphs(https://arxiv.org/abs/2502.19512)
Keywords: foundation model
Abstract: Fully inductive knowledge graph models can be trained on multiple domains and subsequently perform zero-shot knowledge graph completion (KGC) in new unseen domains. This is an important capability towards the goal of having foundation models for knowledge graphs. In this work, we introduce a more expressive and capable fully inductive model, dubbed TRIX, which not only yields strictly more expressive triplet embeddings (head entity, relation, tail entity) compared to state-of-the-art methods, but also introduces a new capability: directly handling both entity and relation prediction tasks in inductive settings. Empirically, we show that TRIX outperforms the state-of-the-art fully inductive models in zero-shot entity and relation predictions in new domains, and outperforms large-context LLMs in out-of-domain predictions. The source code is available at this https URL.

Title: Mixtraining: A Better Trade-Off Between Compute and Performance

Authors: Zexin Li, Jiancheng Zhang, Yinglun Zhu, Cong Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19513
Pdf URL: https://arxiv.org/pdf/2502.19513
Copy Paste: [[2502.19513]] Mixtraining: A Better Trade-Off Between Compute and Performance(https://arxiv.org/abs/2502.19513)
Keywords: self-supervised
Abstract: Incorporating self-supervised learning (SSL) before standard supervised learning (SL) has become a widely used strategy to enhance model performance, particularly in data-limited scenarios. However, this approach introduces a trade-off between computation and performance: while SSL helps with representation learning, it requires a separate, often time-consuming training phase, increasing computational overhead and limiting efficiency in resource-constrained settings. To address these challenges, we propose MixTraining, a novel framework that interleaves several SSL and SL epochs within a unified mixtraining training phase, featuring a smooth transition between two learning objectives. MixTraining enhances synergy between SSL and SL for improved accuracy and consolidates shared computation steps to reduce computation overhead. MixTraining is versatile and applicable to both single-task and multi-task learning scenarios. Extensive experiments demonstrate that MixTraining offers a superior compute-performance trade-off compared to conventional pipelines, achieving an 8.81% absolute accuracy gain (18.89% relative accuracy gain) on the TinyImageNet dataset while accelerating training by up to 1.29x with the ViT-Tiny model.

Title: Retrieval Augmented Anomaly Detection (RAAD): Nimble Model Adjustment Without Retraining

Authors: Sam Pastoriza, Iman Yousfi, Christopher Redino, Marc Vucovich, Abdul Rahman, Sal Aguinaga, Dhruv Nandakumar
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2502.19534
Pdf URL: https://arxiv.org/pdf/2502.19534
Copy Paste: [[2502.19534]] Retrieval Augmented Anomaly Detection (RAAD): Nimble Model Adjustment Without Retraining(https://arxiv.org/abs/2502.19534)
Keywords: anomaly
Abstract: We propose a novel mechanism for real-time (human-in-the-loop) feedback focused on false positive reduction to enhance anomaly detection models. It was designed for the lightweight deployment of a behavioral network anomaly detection model. This methodology is easily integrable to similar domains that require a premium on throughput while maintaining high precision. In this paper, we introduce Retrieval Augmented Anomaly Detection, a novel method taking inspiration from Retrieval Augmented Generation. Human annotated examples are sent to a vector store, which can modify model outputs on the very next processed batch for model inference. To demonstrate the generalization of this technique, we benchmarked several different model architectures and multiple data modalities, including images, text, and graph-based data.

Title: Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

Authors: Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19557
Pdf URL: https://arxiv.org/pdf/2502.19557
Copy Paste: [[2502.19557]] Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?(https://arxiv.org/abs/2502.19557)
Keywords: self-supervised
Abstract: Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT). However, this approach neglects the potential to distill both data (output content) and reward signals (quality evaluations). Extracting reliable reward signals directly from teacher models is challenging, as LLMs are optimized for generation rather than evaluation, often resulting in biased or inconsistent assessments. To address this limitation, we propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses, enabling reward learning without explicit external evaluation. The reward model subsequently guides reinforcement learning (RL), allowing iterative refinement of the student model after an SFT warm-up phase. Experiments on GSM8K and MMLU-PRO demonstrate that our method consistently outperforms traditional SFT-based approaches, enabling student models to surpass the performance of their teachers. This work highlights the potential for scalable, efficient distillation through structured self-supervised reward learning, reducing dependence on external reward supervision.

Title: Stay Focused: Problem Drift in Multi-Agent Debate

Authors: Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, Bela Gipp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19559
Pdf URL: https://arxiv.org/pdf/2502.19559
Copy Paste: [[2502.19559]] Stay Focused: Problem Drift in Multi-Agent Debate(https://arxiv.org/abs/2502.19559)
Keywords: generative
Abstract: Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations, particularly when scaling them to longer reasoning chains. In this study, we unveil a new issue of multi-agent debate: discussions drift away from the initial problem over multiple turns. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, we perform a human study with eight experts on discussions suffering from problem drift, who find the most common issues are a lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To systematically address the issue of problem drift, we propose DRIFTJudge, a method based on LLM-as-a-judge, to detect problem drift at test-time. We further propose DRIFTPolicy, a method to mitigate 31% of problem drift cases. Our study can be seen as a first step to understanding a key limitation of multi-agent debate, highlighting pathways for improving their effectiveness in the future.

Title: Tell me why: Visual foundation models as self-explainable classifiers

Authors: Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19577
Pdf URL: https://arxiv.org/pdf/2502.19577
Copy Paste: [[2502.19577]] Tell me why: Visual foundation models as self-explainable classifiers(https://arxiv.org/abs/2502.19577)
Keywords: foundation model
Abstract: Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at this https URL.

Title: NeoBERT: A Next-Generation BERT

Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19587
Pdf URL: https://arxiv.org/pdf/2502.19587
Copy Paste: [[2502.19587]] NeoBERT: A Next-Generation BERT(https://arxiv.org/abs/2502.19587)
Keywords: in-context
Abstract: Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

Title: Improving Representation Learning of Complex Critical Care Data with ICU-BERT

Authors: Ricardo Santos, André V. Carreiro, Xi Peng, Hugo Gamboa, Holger Fröhlich
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19593
Pdf URL: https://arxiv.org/pdf/2502.19593
Copy Paste: [[2502.19593]] Improving Representation Learning of Complex Critical Care Data with ICU-BERT(https://arxiv.org/abs/2502.19593)
Keywords: generative
Abstract: The multivariate, asynchronous nature of real-world clinical data, such as that generated in Intensive Care Units (ICUs), challenges traditional AI-based decision-support systems. These often assume data regularity and feature independence and frequently rely on limited data scopes and manual feature engineering. The potential of generative AI technologies has not yet been fully exploited to analyze clinical data. We introduce ICU-BERT, a transformer-based model pre-trained on the MIMIC-IV database using a multi-task scheme to learn robust representations of complex ICU data with minimal preprocessing. ICU-BERT employs a multi-token input strategy, incorporating dense embeddings from a biomedical Large Language Model to learn a generalizable representation of complex and multivariate ICU data. With an initial evaluation of five tasks and four additional ICU datasets, ICU-BERT results indicate that ICU-BERT either compares to or surpasses current performance benchmarks by leveraging fine-tuning. By integrating structured and unstructured data, ICU-BERT advances the use of foundational models in medical informatics, offering an adaptable solution for clinical decision support across diverse applications.

Title: Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review

Authors: Sungduk Yu, Man Luo, Avinash Madusu, Vasudev Lal, Phillip Howard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19614
Pdf URL: https://arxiv.org/pdf/2502.19614
Copy Paste: [[2502.19614]] Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review(https://arxiv.org/abs/2502.19614)
Keywords: generative
Abstract: Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs. Motivated by the shortcomings of existing methods, we propose a new detection approach which surpasses existing methods in the identification of AI written peer reviews. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI.

Title: 3D Nephrographic Image Synthesis in CT Urography with the Diffusion Model and Swin Transformer

Authors: Hongkun Yu, Syed Jamal Safdar Gardezi, E. Jason Abel, Daniel Shapiro, Meghan G. Lubner, Joshua Warner, Matthew Smith, Giuseppe Toia, Lu Mao, Pallavi Tiwari, Andrew L. Wentland
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19623
Pdf URL: https://arxiv.org/pdf/2502.19623
Copy Paste: [[2502.19623]] 3D Nephrographic Image Synthesis in CT Urography with the Diffusion Model and Swin Transformer(https://arxiv.org/abs/2502.19623)
Keywords: diffusion
Abstract: Purpose: This study aims to develop and validate a method for synthesizing 3D nephrographic phase images in CT urography (CTU) examinations using a diffusion model integrated with a Swin Transformer-based deep learning approach. Materials and Methods: This retrospective study was approved by the local Institutional Review Board. A dataset comprising 327 patients who underwent three-phase CTU (mean $\pm$ SD age, 63 $\pm$ 15 years; 174 males, 153 females) was curated for deep learning model development. The three phases for each patient were aligned with an affine registration algorithm. A custom deep learning model coined dsSNICT (diffusion model with a Swin transformer for synthetic nephrographic phase images in CT) was developed and implemented to synthesize the nephrographic images. Performance was assessed using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Mean Absolute Error (MAE), and Fréchet Video Distance (FVD). Qualitative evaluation by two fellowship-trained abdominal radiologists was performed. Results: The synthetic nephrographic images generated by our proposed approach achieved high PSNR (26.3 $\pm$ 4.4 dB), SSIM (0.84 $\pm$ 0.069), MAE (12.74 $\pm$ 5.22 HU), and FVD (1323). Two radiologists provided average scores of 3.5 for real images and 3.4 for synthetic images (P-value = 0.5) on a Likert scale of 1-5, indicating that our synthetic images closely resemble real images. Conclusion: The proposed approach effectively synthesizes high-quality 3D nephrographic phase images. This model can be used to reduce radiation dose in CTU by 33.3\% without compromising image quality, which thereby enhances the safety and diagnostic utility of CT urography.

Title: cMIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning

Authors: Micha Livne
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.19642
Pdf URL: https://arxiv.org/pdf/2502.19642
Copy Paste: [[2502.19642]] cMIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning(https://arxiv.org/abs/2502.19642)
Keywords: self-supervised, generative
Abstract: Learning representations that are useful for unknown downstream tasks is a fundamental challenge in representation learning. Prominent approaches in this domain include contrastive learning, self-supervised masking, and denoising auto-encoders. In this paper, we introduce a novel method, termed contrastive Mutual Information Machine (cMIM), which aims to enhance the utility of learned representations for downstream tasks. cMIM integrates a new contrastive learning loss with the Mutual Information Machine (MIM) learning framework, a probabilistic auto-encoder that maximizes the mutual information between inputs and latent representations while clustering the latent codes. Despite MIM's potential, initial experiments indicated that the representations learned by MIM were less effective for discriminative downstream tasks compared to state-of-the-art (SOTA) models. The proposed cMIM method directly addresses this limitation. The main contributions of this work are twofold: (1) We propose a novel contrastive extension to MIM for learning discriminative representations which eliminates the need for data augmentation and is robust to variations in the number of negative examples (i.e., batch size). (2) We introduce a generic method for extracting informative embeddings from encoder-decoder models, which significantly improves performance in discriminative downstream tasks without requiring additional training. This method is applicable to any pre-trained encoder-decoder model. By presenting cMIM, we aim to offer a unified generative model that is effective for both generative and discriminative tasks. Our results demonstrate that the learned representations are valuable for downstream tasks while maintaining the generative capabilities of MIM.

Title: SubZero: Composing Subject, Style, and Action via Zero-Shot Personalization

Authors: Shubhankar Borse, Kartikeya Bhardwaj, Mohammad Reza Karimi Dastjerdi, Hyojin Park, Shreya Kadambi, Shobitha Shivakumar, Prathamesh Mandke, Ankita Nayak, Harris Teague, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19673
Pdf URL: https://arxiv.org/pdf/2502.19673
Copy Paste: [[2502.19673]] SubZero: Composing Subject, Style, and Action via Zero-Shot Personalization(https://arxiv.org/abs/2502.19673)
Keywords: diffusion, generative
Abstract: Diffusion models are increasingly popular for generative tasks, including personalized composition of subjects and styles. While diffusion models can generate user-specified subjects performing text-guided actions in custom styles, they require fine-tuning and are not feasible for personalization on mobile devices. Hence, tuning-free personalization methods such as IP-Adapters have progressively gained traction. However, for the composition of subjects and styles, these works are less flexible due to their reliance on ControlNet, or show content and style leakage artifacts. To tackle these, we present SubZero, a novel framework to generate any subject in any style, performing any action without the need for fine-tuning. We propose a novel set of constraints to enhance subject and style similarity, while reducing leakage. Additionally, we propose an orthogonalized temporal aggregation scheme in the cross-attention blocks of denoising model, effectively conditioning on a text prompt along with single subject and style images. We also propose a novel method to train customized content and style projectors to reduce content and style leakage. Through extensive experiments, we show that our proposed approach, while suitable for running on-edge, shows significant improvements over state-of-the-art works performing subject, style and action composition.

Title: BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Authors: Xin Ye, Burhaneddin Yaman, Sheng Cheng, Feng Tao, Abhirup Mallik, Liu Ren
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19694
Pdf URL: https://arxiv.org/pdf/2502.19694
Copy Paste: [[2502.19694]] BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance(https://arxiv.org/abs/2502.19694)
Keywords: diffusion
Abstract: Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.

Title: You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving

Authors: Guangfeng Jiang, Jun Liu, Yongxuan Lv, Yuzhi Wu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19698
Pdf URL: https://arxiv.org/pdf/2502.19698
Copy Paste: [[2502.19698]] You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving(https://arxiv.org/abs/2502.19698)
Keywords: foundation model
Abstract: Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird's eye view plane. It is a significant challenge to produce high-quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial-based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU-guided enhancement module is proposed, replacing pseudo labels with high-confidence and high-IoU predictions. Experiments on the Waymo dataset demonstrate YoCo's effectiveness and generality, achieving state-of-the-art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine-tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.

Title: Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification

Authors: Yimin Zhu, Linlin Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19699
Pdf URL: https://arxiv.org/pdf/2502.19699
Copy Paste: [[2502.19699]] Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification(https://arxiv.org/abs/2502.19699)
Keywords: diffusion
Abstract: Although efficient extraction of discriminative spatial-spectral features is critical for hyperspectral images classification (HSIC), it is difficult to achieve these features due to factors such as the spatial-spectral heterogeneity and noise effect. This paper presents a Spatial-Spectral Diffusion Contrastive Representation Network (DiffCRN), based on denoising diffusion probabilistic model (DDPM) combined with contrastive learning (CL) for HSIC, with the following characteristics. First,to improve spatial-spectral feature representation, instead of adopting the UNets-like structure which is widely used for DDPM, we design a novel staged architecture with spatial self-attention denoising module (SSAD) and spectral group self-attention denoising module (SGSAD) in DiffCRN with improved efficiency for spectral-spatial feature learning. Second, to improve unsupervised feature learning efficiency, we design new DDPM model with logarithmic absolute error (LAE) loss and CL that improve the loss function effectiveness and increase the instance-level and inter-class discriminability. Third, to improve feature selection, we design a learnable approach based on pixel-level spectral angle mapping (SAM) for the selection of time steps in the proposed DDPM model in an adaptive and automatic manner. Last, to improve feature integration and classification, we design an Adaptive weighted addition modul (AWAM) and Cross time step Spectral-Spatial Fusion Module (CTSSFM) to fuse time-step-wise features and perform classification. Experiments conducted on widely used four HSI datasets demonstrate the improved performance of the proposed DiffCRN over the classical backbone models and state-of-the-art GAN, transformer models and other pretrained methods. The source code and pre-trained model will be made available publicly.

Title: Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model

Authors: Yimin Zhu, Linlin Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19700
Pdf URL: https://arxiv.org/pdf/2502.19700
Copy Paste: [[2502.19700]] Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model(https://arxiv.org/abs/2502.19700)
Keywords: diffusion
Abstract: Although data augmentation is an effective method to address the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC), most methodologies extend features in the latent space. Few, however, generate realistic and diverse samples using text information to balance the limited number of annotated samples. Recently, text-driven diffusion models have gained significant attention due to their remarkable ability to generate highly diverse images based on given text prompts in natural image synthesis. Therefore, this paper proposes a novel language-informed hyperspectral image synthesis method (Txt2HSI-LDM(VAE)) for addressing the ISSD problem of HSIC. First, for addressing the high-dimensional hyperspectral data, we use universal varitional autoencoeder (VAE) to map the hyperspectral into a low-dimensional latent space and get stable feature representation, which hugely reduce the inference parameter of diffusion model. Next, a semi-supervised diffusion model is designed for fully taking advantage of unlabeled data, beside, random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are also used for simulating the varying degrees of mixing of training data. Then, VAE decodes HSI from latent space generated by diffusion model with the conditional language as input, contributing to more realistic and diverse samples. In our experiments, we fully evaluate the effectiveness of synthetic samples from aspect of statistical characteristic and data distribution in 2D-PCA space. Additionally, cross-attention map is visualized on the pixel-level to prove that our proposed model can capture the spatial layout of and geometry of the generated hyperspectral image depend on the visual-linguistic alignment.

Title: SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models

Authors: Mingsi Wang, Shuaiyin Yao, Chang Yue, Lijie Zhang, Guozhu Meng
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2502.19710
Pdf URL: https://arxiv.org/pdf/2502.19710
Copy Paste: [[2502.19710]] SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models(https://arxiv.org/abs/2502.19710)
Keywords: diffusion
Abstract: Given the need to evaluate the robustness of face recognition (FR) models, many efforts have focused on adversarial patch attacks that mislead FR models by introducing localized perturbations. Impersonation attacks are a significant threat because adversarial perturbations allow attackers to disguise themselves as legitimate users. This can lead to severe consequences, including data breaches, system damage, and misuse of resources. However, research on such attacks in FR remains limited. Existing adversarial patch generation methods exhibit limited efficacy in impersonation attacks due to (1) the need for high attacker capabilities, (2) low attack success rates, and (3) excessive query requirements. To address these challenges, we propose a novel method SAP-DIFF that leverages diffusion models to generate adversarial patches via semantic perturbations in the latent space rather than direct pixel manipulation. We introduce an attention disruption mechanism to generate features unrelated to the original face, facilitating the creation of adversarial samples and a directional loss function to guide perturbations toward the target identity feature space, thereby enhancing attack effectiveness and efficiency. Extensive experiments on popular FR models and datasets demonstrate that our method outperforms state-of-the-art approaches, achieving an average attack success rate improvement of 45.66% (all exceeding 40%), and a reduction in the number of queries by about 40% compared to the SOTA approach

Title: Recent Advances on Generalizable Diffusion-generated Image Detection

Authors: Qijie Xu, Defang Chen, Jiawei Chen, Siwei Lyu, Can Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19716
Pdf URL: https://arxiv.org/pdf/2502.19716
Copy Paste: [[2502.19716]] Recent Advances on Generalizable Diffusion-generated Image Detection(https://arxiv.org/abs/2502.19716)
Keywords: diffusion
Abstract: The rise of diffusion models has significantly improved the fidelity and diversity of generated images. With numerous benefits, these advancements also introduce new risks. Diffusion models can be exploited to create high-quality Deepfake images, which poses challenges for image authenticity verification. In recent years, research on generalizable diffusion-generated image detection has grown rapidly. However, a comprehensive review of this topic is still lacking. To bridge this gap, we present a systematic survey of recent advances and classify them into two main categories: (1) data-driven detection and (2) feature-driven detection. Existing detection methods are further classified into six fine-grained categories based on their underlying principles. Finally, we identify several open challenges and envision some future directions, with the hope of inspiring more research work on this important topic. Reviewed works in this survey can be found at this https URL.

Title: Learning Mask Invariant Mutual Information for Masked Image Modeling

Authors: Tao Huang, Yanxiang Ma, Shan You, Chang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19718
Pdf URL: https://arxiv.org/pdf/2502.19718
Copy Paste: [[2502.19718]] Learning Mask Invariant Mutual Information for Masked Image Modeling(https://arxiv.org/abs/2502.19718)
Keywords: self-supervised
Abstract: Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.

Title: Few-Shot Multilingual Open-Domain QA from 5 Examples

Authors: Fan Jiang, Tom Drummond, Trevor Cohn
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.19722
Pdf URL: https://arxiv.org/pdf/2502.19722
Copy Paste: [[2502.19722]] Few-Shot Multilingual Open-Domain QA from 5 Examples(https://arxiv.org/abs/2502.19722)
Keywords: self-supervised
Abstract: Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.

Title: Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network

Authors: Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19754
Pdf URL: https://arxiv.org/pdf/2502.19754
Copy Paste: [[2502.19754]] Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network(https://arxiv.org/abs/2502.19754)
Keywords: diffusion, self-supervised
Abstract: In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights $f_A(t)$ and $f_B(t)$, as they follow the paradigm ($x_t = f_A(t)x_{Img} + f_B(t)\epsilon$). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced by \textbf{more than 15\%}, especially with a reduction of 48.50\% when NFE of DDIM is $5$ for the CelebA dataset. Code is available at this https URL.

Title: EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models

Authors: Che Hyun Lee, Heeseung Kim, Jiheum Yeom, Sungroh Yoon
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19765
Pdf URL: https://arxiv.org/pdf/2502.19765
Copy Paste: [[2502.19765]] EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models(https://arxiv.org/abs/2502.19765)
Keywords: diffusion
Abstract: We propose EdiText, a controllable text editing method that modify the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at broad range of levels across various tasks, including toxicity control and sentiment control.

Title: In-Context Learning with Hypothesis-Class Guidance

Authors: Ziqian Lin, Shubham Kumar Bharti, Kangwook Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.19787
Pdf URL: https://arxiv.org/pdf/2502.19787
Copy Paste: [[2502.19787]] In-Context Learning with Hypothesis-Class Guidance(https://arxiv.org/abs/2502.19787)
Keywords: in-context
Abstract: Recent research has investigated the underlying mechanisms of in-context learning (ICL) both theoretically and empirically, often using data generated from simple function classes. However, the existing work often focuses on the sequence consisting solely of labeled examples, while in practice, labeled examples are typically accompanied by an instruction, providing some side information about the task. In this work, we propose ICL with hypothesis-class guidance (ICL-HCG), a novel synthetic data model for ICL where the input context consists of the literal description of a (finite) hypothesis class $\mathcal{H}$ and $(x,y)$ pairs from a hypothesis chosen from $\mathcal{H}$. Under our framework ICL-HCG, we conduct extensive experiments to explore: (i) a variety of generalization abilities to new hypothesis classes; (ii) different model architectures; (iii) sample complexity; (iv) in-context data imbalance; (v) the role of instruction; and (vi) the effect of pretraining hypothesis diversity. As a result, we show that (a) Transformers can successfully learn ICL-HCG and generalize to unseen hypotheses and unseen hypothesis classes, and (b) compared with ICL without instruction, ICL-HCG achieves significantly higher accuracy, demonstrating the role of instructions.

Title: Mixtera: A Data Plane for Foundation Model Training

Authors: Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Ana Klimovic
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2502.19790
Pdf URL: https://arxiv.org/pdf/2502.19790
Copy Paste: [[2502.19790]] Mixtera: A Data Plane for Foundation Model Training(https://arxiv.org/abs/2502.19790)
Keywords: foundation model
Abstract: State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well as dynamic adjustment of the mixture based on model feedback. We experimentally evaluate Mixtera and show that our implementation does not bottleneck training and scales to 256 GH200 superchips. We demonstrate how Mixtera supports recent advancements in mixing strategies by implementing the proposed Adaptive Data Optimization (ADO) algorithm in the system and evaluating its performance impact. We also explore the role of mixtures for vision-language models.

Title: MFSR: Multi-fractal Feature for Super-resolution Reconstruction with Fine Details Recovery

Authors: Lianping Yang, Peng Jiao, Jinshan Pan, Hegui Zhu, Su Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19797
Pdf URL: https://arxiv.org/pdf/2502.19797
Copy Paste: [[2502.19797]] MFSR: Multi-fractal Feature for Super-resolution Reconstruction with Fine Details Recovery(https://arxiv.org/abs/2502.19797)
Keywords: diffusion
Abstract: In the process of performing image super-resolution processing, the processing of complex localized information can have a significant impact on the quality of the image generated. Fractal features can capture the rich details of both micro and macro texture structures in an image. Therefore, we propose a diffusion model-based super-resolution method incorporating fractal features of low-resolution images, named MFSR. MFSR leverages these fractal features as reinforcement conditions in the denoising process of the diffusion model to ensure accurate recovery of texture information. MFSR employs convolution as a soft assignment to approximate the fractal features of low-resolution images. This approach is also used to approximate the density feature maps of these images. By using soft assignment, the spatial layout of the image is described hierarchically, encoding the self-similarity properties of the image at different scales. Different processing methods are applied to various types of features to enrich the information acquired by the model. In addition, a sub-denoiser is integrated in the denoising U-Net to reduce the noise in the feature maps during the up-sampling process in order to improve the quality of the generated images. Experiments conducted on various face and natural image datasets demonstrate that MFSR can generate higher quality images.

Title: UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition

Authors: Xiao Lin, Yuge Huang, Jianqing Xu, Yuxi Mi, Shuigeng Zhou, Shouhong Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19803
Pdf URL: https://arxiv.org/pdf/2502.19803
Copy Paste: [[2502.19803]] UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition(https://arxiv.org/abs/2502.19803)
Keywords: diffusion
Abstract: Face recognition (FR) stands as one of the most crucial applications in computer vision. The accuracy of FR models has significantly improved in recent years due to the availability of large-scale human face datasets. However, directly using these datasets can inevitably lead to privacy and legal problems. Generating synthetic data to train FR models is a feasible solution to circumvent these issues. While existing synthetic-based face recognition methods have made significant progress in generating identity-preserving images, they are severely plagued by context overfitting, resulting in a lack of intra-class diversity of generated images and poor face recognition performance. In this paper, we propose a framework to Unleash Inherent capability of the model to enhance intra-class diversity for synthetic face recognition, shortened as UIFace. Our framework first trains a diffusion model that can perform sampling conditioned on either identity contexts or a learnable empty context. The former generates identity-preserving images but lacks variations, while the latter exploits the model's intrinsic ability to synthesize intra-class-diversified images but with random identities. Then we adopt a novel two-stage sampling strategy during inference to fully leverage the strengths of both types of contexts, resulting in images that are diverse as well as identitypreserving. Moreover, an attention injection module is introduced to further augment the intra-class variations by utilizing attention maps from the empty context to guide the sampling process in ID-conditioned generation. Experiments show that our method significantly surpasses previous approaches with even less training data and half the size of synthetic dataset. The proposed UIFace even achieves comparable performance with FR models trained on real datasets when we further increase the number of synthetic identities.

Title: Implicit Search via Discrete Diffusion: A Study on Chess

Authors: Jiacheng Ye, Zhenyu Wu, Jiahui Gao, Zhiyong Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19805
Pdf URL: https://arxiv.org/pdf/2502.19805
Copy Paste: [[2502.19805]] Implicit Search via Discrete Diffusion: A Study on Chess(https://arxiv.org/abs/2502.19805)
Keywords: diffusion
Abstract: In the post-AlphaGo era, there has been a renewed interest in search techniques such as Monte Carlo Tree Search (MCTS), particularly in their application to Large Language Models (LLMs). This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch , a model that does \textit{implicit search} by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search-based policies, along with a significant 540 Elo increase in game-playing strength assessment. These results indicate that implicit search via discrete diffusion is a viable alternative to explicit search over a one-step policy. All codes are publicly available at \href{this https URL}{this https URL}.

Title: Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study

Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19828
Pdf URL: https://arxiv.org/pdf/2502.19828
Copy Paste: [[2502.19828]] Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study(https://arxiv.org/abs/2502.19828)
Keywords: diffusion
Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP's image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP's training process and provide evidence through analyses of the COCO dataset and CLIP's training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing that biases in the CLIP text encoder significantly impact text-to-image generation tasks. Our experiments demonstrate how these biases affect CLIP's performance in image-caption matching and generation tasks, particularly when manipulating object sizes and their order in captions. This work contributes valuable insights into CLIP's behavior in complex visual environments and highlights areas for improvement in future vision-language models.

Title: CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19842
Pdf URL: https://arxiv.org/pdf/2502.19842
Copy Paste: [[2502.19842]] CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation(https://arxiv.org/abs/2502.19842)
Keywords: diffusion
Abstract: Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: this https URL.

Title: One-for-More: Continual Diffusion Model for Anomaly Detection

Authors: Xiaofan Li, Xin Tan, Zhuo Chen, Zhizhong Zhang, Ruixin Zhang, Rizen Guo, Guanna Jiang, Yulong Chen, Yanyun Qu, Lizhuang Ma, Yuan Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19848
Pdf URL: https://arxiv.org/pdf/2502.19848
Copy Paste: [[2502.19848]] One-for-More: Continual Diffusion Model for Anomaly Detection(https://arxiv.org/abs/2502.19848)
Keywords: diffusion, generative, anomaly
Abstract: With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe ``faithfulness hallucination'' and ``catastrophic forgetting'', which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting'' to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at this https URL

Title: MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

Authors: Yujia Chen, Changsong Li, Yiming Wang, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19860
Pdf URL: https://arxiv.org/pdf/2502.19860
Copy Paste: [[2502.19860]] MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue(https://arxiv.org/abs/2502.19860)
Keywords: generative
Abstract: Mental health issues are worsening in today's competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LLMs to be equipped with human-like adaptability and warmth. To fill this gap, we propose the MIND (Multi-agent INner Dialogue), a novel paradigm that provides more immersive psychological healing environments. Considering the strong generative and role-playing ability of LLM agents, we predefine an interactive healing framework and assign LLM agents different roles within the framework to engage in interactive inner dialogues with users, thereby providing an immersive healing experience. We conduct extensive human experiments in various real-world healing dimensions, and find that MIND provides a more user-friendly experience than traditional paradigms. This demonstrates that MIND effectively leverages the significant potential of LLMs in psychological healing.

Title: C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation

Authors: Yuhao Li, Mirana Claire Angel, Salman Khan, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19868
Pdf URL: https://arxiv.org/pdf/2502.19868
Copy Paste: [[2502.19868]] C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation(https://arxiv.org/abs/2502.19868)
Keywords: diffusion
Abstract: Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at this https URL.

Title: High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

Authors: Mingtao Guo, Guanyu Xing, Yanli Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19894
Pdf URL: https://arxiv.org/pdf/2502.19894
Copy Paste: [[2502.19894]] High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model(https://arxiv.org/abs/2502.19894)
Keywords: diffusion
Abstract: Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.

Title: GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors

Authors: An Li, Zhe Zhu, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19896
Pdf URL: https://arxiv.org/pdf/2502.19896
Copy Paste: [[2502.19896]] GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors(https://arxiv.org/abs/2502.19896)
Keywords: generative
Abstract: Existing point cloud completion methods, which typically depend on predefined synthetic training datasets, encounter significant challenges when applied to out-of-distribution, real-world scans. To overcome this limitation, we introduce a zero-shot completion framework, termed GenPC, designed to reconstruct high-quality real-world scans by leveraging explicit 3D generative priors. Our key insight is that recent feed-forward 3D generative models, trained on extensive internet-scale data, have demonstrated the ability to perform 3D generation from single-view images in a zero-shot setting. To harness this for completion, we first develop a Depth Prompting module that links partial point clouds with image-to-3D generative models by leveraging depth images as a stepping stone. To retain the original partial structure in the final results, we design the Geometric Preserving Fusion module that aligns the generated shape with input by adaptively adjusting its pose and scale. Extensive experiments on widely used benchmarks validate the superiority and generalizability of our approach, bringing us a step closer to robust real-world scan completion.

Title: Image Referenced Sketch Colorization Based on Animation Creation Workflow

Authors: Dingkun Yan, Xinrui Wang, Zhuoru Li, Suguru Saito, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2502.19937
Pdf URL: https://arxiv.org/pdf/2502.19937
Copy Paste: [[2502.19937]] Image Referenced Sketch Colorization Based on Animation Creation Workflow(https://arxiv.org/abs/2502.19937)
Keywords: diffusion
Abstract: Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes are available at this https URL tellurion-kanata/colorizeDiffusion.

Title: SeisMoLLM: Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language Model

Authors: Xinghao Wang, Feng Liu, Rui Su, Zhihui Wang, Lei Bai, Wanli Ouyang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.19960
Pdf URL: https://arxiv.org/pdf/2502.19960
Copy Paste: [[2502.19960]] SeisMoLLM: Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language Model(https://arxiv.org/abs/2502.19960)
Keywords: foundation model
Abstract: Recent advances in deep learning have revolutionized seismic monitoring, yet developing a foundation model that performs well across multiple complex tasks remains challenging, particularly when dealing with degraded signals or data scarcity. This work presents SeisMoLLM, the first foundation model that utilizes cross-modal transfer for seismic monitoring, to unleash the power of large-scale pre-training from a large language model without requiring direct pre-training on seismic datasets. Through elaborate waveform tokenization and fine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art performance on the DiTing and STEAD datasets across five critical tasks: back-azimuth estimation, epicentral distance estimation, magnitude estimation, phase picking, and first-motion polarity classification. It attains 36 best results out of 43 task metrics and 12 top scores out of 16 few-shot generalization metrics, with many relative improvements ranging from 10% to 50%. In addition to its superior performance, SeisMoLLM maintains efficiency comparable to or even better than lightweight models in both training and inference. These findings establish SeisMoLLM as a promising foundation model for practical seismic monitoring and highlight cross-modal transfer as an exciting new direction for earthquake studies, showcasing the potential of advanced deep learning techniques to propel seismology research forward.

Title: A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation

Authors: Tianyang Qi, Shibo Chen, Jun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.20068
Pdf URL: https://arxiv.org/pdf/2502.20068
Copy Paste: [[2502.20068]] A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation(https://arxiv.org/abs/2502.20068)
Keywords: generative
Abstract: With the widespread adoption of electric vehicles (EVs), navigating for EV drivers to select a cost-effective charging station has become an important yet challenging issue due to dynamic traffic conditions, fluctuating electricity prices, and potential competition from other EVs. The state-of-the-art deep reinforcement learning (DRL) algorithms for solving this task still require global information about all EVs at the execution stage, which not only increases communication costs but also raises privacy issues among EV drivers. To overcome these drawbacks, we introduce a novel generative model-enhanced multi-agent DRL algorithm that utilizes only the EV's local information while achieving performance comparable to these state-of-the-art algorithms. Specifically, the policy network is implemented on the EV side, and a Conditional Variational Autoencoder-Long Short Term Memory (CVAE-LSTM)-based recommendation model is developed to provide recommendation information. Furthermore, a novel future charging competition encoder is designed to effectively compress global information, enhancing training performance. The multi-gradient descent algorithm (MGDA) is also utilized to adaptively balance the weight between the two parts of the training objective, resulting in a more stable training process. Simulations are conducted based on a practical area in Xián, China. Experimental results show that our proposed algorithm, which relies on local information, outperforms existing local information-based methods and achieves less than 8\% performance loss compared to global information-based methods.

Title: VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers

Authors: Ziang Guo, Konstantin Gubernatorov, Selamawit Asfaw, Zakhar Yagudin, Dzmitry Tsetserukou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2502.20108
Pdf URL: https://arxiv.org/pdf/2502.20108
Copy Paste: [[2502.20108]] VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers(https://arxiv.org/abs/2502.20108)
Keywords: diffusion
Abstract: In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.

Title: Self-Training Elicits Concise Reasoning in Large Language Models

Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20122
Pdf URL: https://arxiv.org/pdf/2502.20122
Copy Paste: [[2502.20122]] Self-Training Elicits Concise Reasoning in Large Language Models(https://arxiv.org/abs/2502.20122)
Keywords: in-context
Abstract: Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at this https URL

Title: FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Authors: Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Schönfeld
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.20126
Pdf URL: https://arxiv.org/pdf/2502.20126
Copy Paste: [[2502.20126]] FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute(https://arxiv.org/abs/2502.20126)
Keywords: diffusion
Abstract: Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.

Title: Your contrastive learning problem is secretly a distribution alignment problem

Authors: Zihao Chen, Chi-Heng Lin, Ran Liu, Jingyun Xiao, Eva L Dyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.20141
Pdf URL: https://arxiv.org/pdf/2502.20141
Copy Paste: [[2502.20141]] Your contrastive learning problem is secretly a distribution alignment problem(https://arxiv.org/abs/2502.20141)
Keywords: self-supervised
Abstract: Despite the success of contrastive learning (CL) in vision and language, its theoretical foundations and mechanisms for building representations remain poorly understood. In this work, we build connections between noise contrastive estimation losses widely used in CL and distribution alignment with entropic optimal transport (OT). This connection allows us to develop a family of different losses and multistep iterative variants for existing CL methods. Intuitively, by using more information from the distribution of latents, our approach allows a more distribution-aware manipulation of the relationships within augmented sample sets. We provide theoretical insights and experimental evidence demonstrating the benefits of our approach for {\em generalized contrastive alignment}. Through this framework, it is possible to leverage tools in OT to build unbalanced losses to handle noisy views and customize the representation space by changing the constraints on alignment. By reframing contrastive learning as an alignment problem and leveraging existing optimization tools for OT, our work provides new insights and connections between different self-supervised learning models in addition to new tools that can be more easily adapted to incorporate domain knowledge into learning.

Title: Adaptive H&E-IHC information fusion staining framework based on feature extra

Authors: Yifan Jia, Xingda Yu, Zhengyang Ji, Songning Lai, Yutao Yue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20156
Pdf URL: https://arxiv.org/pdf/2502.20156
Copy Paste: [[2502.20156]] Adaptive H&E-IHC information fusion staining framework based on feature extra(https://arxiv.org/abs/2502.20156)
Keywords: generative
Abstract: Immunohistochemistry (IHC) staining plays a significant role in the evaluation of diseases such as breast cancer. The H&E-to-IHC transformation based on generative models provides a simple and cost-effective method for obtaining IHC images. Although previous models can perform digital coloring well, they still suffer from (i) coloring only through the pixel features that are not prominent in HE, which is easy to cause information loss in the coloring process; (ii) The lack of pixel-perfect H&E-IHC groundtruth pairs poses a challenge to the classical L1 this http URL address the above challenges, we propose an adaptive information enhanced coloring framework based on feature extractors. We first propose the VMFE module to effectively extract the color information features using multi-scale feature extraction and wavelet transform convolution, while combining the shared decoder for feature fusion. The high-performance dual feature extractor of H&E-IHC is trained by contrastive learning, which can effectively perform feature alignment of HE-IHC in high latitude space. At the same time, the trained feature encoder is used to enhance the features and adaptively adjust the loss in the HE section staining process to solve the problems related to unclear and asymmetric information. We have tested on different datasets and achieved excellent this http URL code is available at this https URL

Title: Learning to Generalize without Bias for Open-Vocabulary Action Recognition

Authors: Yating Yu, Congqi Cao, Yifan Zhang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20158
Pdf URL: https://arxiv.org/pdf/2502.20158
Copy Paste: [[2502.20158]] Learning to Generalize without Bias for Open-Vocabulary Action Recognition(https://arxiv.org/abs/2502.20158)
Keywords: in-context
Abstract: Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.

Title: Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Authors: Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2502.20172
Pdf URL: https://arxiv.org/pdf/2502.20172
Copy Paste: [[2502.20172]] Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think(https://arxiv.org/abs/2502.20172)
Keywords: diffusion
Abstract: The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.

Title: Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Authors: Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, Shunsuke Saito
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20220
Pdf URL: https://arxiv.org/pdf/2502.20220
Copy Paste: [[2502.20220]] Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars(https://arxiv.org/abs/2502.20220)
Keywords: foundation model
Abstract: Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts. Project website: this https URL

Title: Attention Distillation: A Unified Approach to Visual Characteristics Transfer

Authors: Yang Zhou, Xu Gao, Zichong Chen, Hui Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20235
Pdf URL: https://arxiv.org/pdf/2502.20235
Copy Paste: [[2502.20235]] Attention Distillation: A Unified Approach to Visual Characteristics Transfer(https://arxiv.org/abs/2502.20235)
Keywords: diffusion, generative
Abstract: Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis. Code is available at this https URL.

Title: From Retrieval to Generation: Comparing Different Approaches

Authors: Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20245
Pdf URL: https://arxiv.org/pdf/2502.20245
Copy Paste: [[2502.20245]] From Retrieval to Generation: Comparing Different Approaches(https://arxiv.org/abs/2502.20245)
Keywords: generative
Abstract: Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.

Title: Do computer vision foundation models learn the low-level characteristics of the human visual system?

Authors: Yancheng Cai, Fei Yin, Dounia Hammou, Rafal Mantiuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20256
Pdf URL: https://arxiv.org/pdf/2502.20256
Copy Paste: [[2502.20256]] Do computer vision foundation models learn the low-level characteristics of the human visual system?(https://arxiv.org/abs/2502.20256)
Keywords: self-supervised, foundation model, generative
Abstract: Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on vision tasks start to align with low-level human vision, with DINOv2 showing the closest resemblance.

Title: Vector-Quantized Vision Foundation Models for Object-Centric Learning

Authors: Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20263
Pdf URL: https://arxiv.org/pdf/2502.20263
Copy Paste: [[2502.20263]] Vector-Quantized Vision Foundation Models for Object-Centric Learning(https://arxiv.org/abs/2502.20263)
Keywords: foundation model
Abstract: Decomposing visual scenes into objects, as humans do, facilitates modeling object relations and dynamics. Object-Centric Learning (OCL) achieves this by aggregating image or video feature maps into object-level feature vectors, known as \textit{slots}. OCL's self-supervision via reconstructing the input from slots struggles with complex textures, thus many methods employ Vision Foundation Models (VFMs) to extract feature maps with better objectness. However, using VFMs merely as feature extractors does not fully unlock their potential. We propose Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO), where VFM features are extracted to facilitate object-level information aggregation and further quantized to strengthen supervision in reconstruction. Our VVO unifies OCL representatives into a concise architecture. Experiments demonstrate that VVO not only outperforms mainstream methods on object discovery tasks but also benefits downstream tasks like visual prediction and reasoning. The source code is available in the supplement.

Title: Explainable, Multi-modal Wound Infection Classification from Images Augmented with Generated Captions

Authors: Palawat Busaranuvong, Emmanuel Agu, Reza Saadati Fard, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20277
Pdf URL: https://arxiv.org/pdf/2502.20277
Copy Paste: [[2502.20277]] Explainable, Multi-modal Wound Infection Classification from Images Augmented with Generated Captions(https://arxiv.org/abs/2502.20277)
Keywords: diffusion
Abstract: Infections in Diabetic Foot Ulcers (DFUs) can cause severe complications, including tissue death and limb amputation, highlighting the need for accurate, timely diagnosis. Previous machine learning methods have focused on identifying infections by analyzing wound images alone, without utilizing additional metadata such as medical notes. In this study, we aim to improve infection detection by introducing Synthetic Caption Augmented Retrieval for Wound Infection Detection (SCARWID), a novel deep learning framework that leverages synthetic textual descriptions to augment DFU images. SCARWID consists of two components: (1) Wound-BLIP, a Vision-Language Model (VLM) fine-tuned on GPT-4o-generated descriptions to synthesize consistent captions from images; and (2) an Image-Text Fusion module that uses cross-attention to extract cross-modal embeddings from an image and its corresponding Wound-BLIP caption. Infection status is determined by retrieving the top-k similar items from a labeled support set. To enhance the diversity of training data, we utilized a latent diffusion model to generate additional wound images. As a result, SCARWID outperformed state-of-the-art models, achieving average sensitivity, specificity, and accuracy of 0.85, 0.78, and 0.81, respectively, for wound infection classification. Displaying the generated captions alongside the wound images and infection detection results enhances interpretability and trust, enabling nurses to align SCARWID outputs with their medical knowledge. This is particularly valuable when wound notes are unavailable or when assisting novice nurses who may find it difficult to identify visual attributes of wound infection.

Title: Mobius: Text to Seamless Looping Video Generation via Latent Shift

Authors: Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20307
Pdf URL: https://arxiv.org/pdf/2502.20307
Copy Paste: [[2502.20307]] Mobius: Text to Seamless Looping Video Generation via Latent Shift(https://arxiv.org/abs/2502.20307)
Keywords: diffusion
Abstract: We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform multi-frame latent denoising by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the latent cycle in our method can be of any length. This extends our latent-shifting approach to generate seamless looping videos beyond the scope of the video diffusion model's context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.

Title: FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

Authors: Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20313
Pdf URL: https://arxiv.org/pdf/2502.20313
Copy Paste: [[2502.20313]] FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction(https://arxiv.org/abs/2502.20313)
Keywords: diffusion
Abstract: This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ($\leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256$\times$256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512$\times$512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512$\times$512 resolution.

Title: Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds

Authors: Mohamed Abdelsamad, Michael Ulrich, Claudius Gläser, Abhinav Valada
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2502.20316
Pdf URL: https://arxiv.org/pdf/2502.20316
Copy Paste: [[2502.20316]] Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds(https://arxiv.org/abs/2502.20316)
Keywords: self-supervised, generative
Abstract: Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre-training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state-of-the-art on multiple benchmarks for multiple point cloud perception tasks.

Title: ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Authors: Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20323
Pdf URL: https://arxiv.org/pdf/2502.20323
Copy Paste: [[2502.20323]] ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model(https://arxiv.org/abs/2502.20323)
Keywords: diffusion
Abstract: Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles using sample motion sequences, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.

Title: Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

Authors: Ryan C. Barron, Maksim E. Eren, Olga M. Serafimova, Cynthia Matuszek, Boian S. Alexandrov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20364
Pdf URL: https://arxiv.org/pdf/2502.20364
Copy Paste: [[2502.20364]] Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization(https://arxiv.org/abs/2502.20364)
Keywords: generative
Abstract: Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.

Title: Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

Authors: Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20370
Pdf URL: https://arxiv.org/pdf/2502.20370
Copy Paste: [[2502.20370]] Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation(https://arxiv.org/abs/2502.20370)
Keywords: diffusion
Abstract: This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one's motions based on the counterpart's complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its "brain", enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart's motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.

Title: Constrained Generative Modeling with Manually Bridged Diffusion Models

Authors: Saeid Naderiparizi, Xiaoxuan Liang, Berend Zwartsenberg, Frank Wood
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.20371
Pdf URL: https://arxiv.org/pdf/2502.20371
Copy Paste: [[2502.20371]] Constrained Generative Modeling with Manually Bridged Diffusion Models(https://arxiv.org/abs/2502.20371)
Keywords: diffusion, generative
Abstract: In this paper we describe a novel framework for diffusion-based generative modeling on constrained spaces. In particular, we introduce manual bridges, a framework that expands the kinds of constraints that can be practically used to form so-called diffusion bridges. We develop a mechanism for combining multiple such constraints so that the resulting multiply-constrained model remains a manual bridge that respects all constraints. We also develop a mechanism for training a diffusion model that respects such multiple constraints while also adapting it to match a data distribution. We develop and extend theory demonstrating the mathematical validity of our mechanisms. Additionally, we demonstrate our mechanism in constrained generative modeling tasks, highlighting a particular high-value application in modeling trajectory initializations for path planning and control in autonomous vehicles.

Title: Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Authors: Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20388
Pdf URL: https://arxiv.org/pdf/2502.20388
Copy Paste: [[2502.20388]] Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation(https://arxiv.org/abs/2502.20388)
Keywords: generative
Abstract: Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as \textbf{continuous entity regression}, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.

Title: InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Authors: Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2502.20390
Pdf URL: https://arxiv.org/pdf/2502.20390
Copy Paste: [[2502.20390]] InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions(https://arxiv.org/abs/2502.20390)
Keywords: generative
Abstract: Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.